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Preface 



These notes grew from an introduction to probability theory taught during 
the first and second term of 1994 at Caltech. There was a mixed audience of 
undergraduates and graduate students in the first half of the course which 
covered Chapters 2 and 3, and mostly graduate students in the second part 
which covered Chapter 4 and two sections of Chapter 5. 

Having been online for many years on my personal web sites, the text got 
reviewed, corrected and indexed in the summer of 2006. It obtained some 
enhancements which benefited from some other teaching notes and research, 
I wrote while teaching probability theory at the University of Arizona in 
Tucson or when incorporating probability in calculus courses at Caltech 
and Harvard University. 

Most of Chapter 2 is standard material and subject of virtually any course 
on probability theory. Also Chapters 3 and 4 is well covered by the litera- 
ture but not in this combination. 

The last chapter "selected topics" got considerably extended in the summer 
of 2006. While in the original course, only localization and percolation prob- 
lems were included, I added other topics like estimation theory, Vlasov dy- 
namics, multi-dimensional moment problems, random maps, circle-valued 
random variables, the geometry of numbers, Diophantine equations and 
harmonic analysis. Some of this material is related to research I got inter- 
ested in over time. 

While the text assumes no prerequisites in probability, a basic exposure to 
calculus and linear algebra is necessary. Some real analysis as well as some 
background in topology and functional analysis can be helpful. 

I would like to get feedback from readers. I plan to keep this text alive and 
update it in the future. You can email this to knill@math.harvard.edu and 
also indicate on the email if you don't want your feedback to be acknowl- 
edged in an eventual future edition of these notes. 
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To get a more detailed and analytic exposure to probability, the students 
of the original course have consulted the book [109] which contains much 
more material than covered in class. Since my course had been taught, 
many other books have appeared. Examples are [21, 35]. 

For a less analytic approach, see [41, 95, 101] or the still excellent classic 
[26]. For an introduction to martingales, we recommend [113] and [48] from 
both of which these notes have benefited a lot and to which the students 
of the original course had access too. 

For Brownian motion, we refer to [75, 68], for stochastic processes to [17], 
for stochastic differential equation to [2, 56, 78, 68, 47], for random walks 
to [104], for Markov chains to [27, 91], for entropy and Markov operators 
[63]. For applications in physics and chemistry, see [111]. 

For the selected topics, we followed [33] in the percolation section. The 
books [105, 31] contain introductions to Vlasov dynamics. The book of [1] 
gives an introduction for the moment problem, [77, 66] for circle-valued 
random variables, for Poisson processes, see [50, 9]. For the geometry of 
numbers for Fourier series on fractals [46]. 

The book [114] contains examples which challenge the theory with counter 
examples. [34, 96, 72] are sources for problems with solutions. 

Probability theory can be developed using nonstandard analysis on finite 
probability spaces [76]. The book [43] breaks some of the material of the 
first chapter into attractive stories. Also texts like [93, 80] are not only for 
mathematical tourists. 

We live in a time, in which more and more content is available online. 
Knowledge diffuses from papers and books to online websites and databases 
which also ease the digging for knowledge in the fascinating field of proba- 
bility theory. 

Oliver Knill, March 20, 2008 
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Chapter 1 

Introduction 



1.1 What is probability theory? 

Probability theory is a fundamental pillar of modern mathematics with 
relations to other mathematical areas like algebra, topology, analysis, ge- 
ometry or dynamical systems. As with any fundamental mathematical con- 
struction, the theory starts by adding more structure to a set fl. In a similar 
way as introducing algebraic operations, a topology, or a time evolution on 
a set, probability theory adds a measure theoretical structure to il which 
generalizes "counting" on finite sets: in order to measure the probability 
of a subset A C fl, one singles out a class of subsets A, on which one can 
hope to do so. This leads to the notion of a tr-algebra A. It is a set of sub- 
sets of fl in which on can perform finitely or countably many operations 
like taking unions, complements or intersections. The elements in A are 
called events. If a point uj in the "laboratory" fl denotes an "experiment", 
an "event" A G A is a subset of f2, for which one can assign a proba- 
bility P[A] e [0,1]. For example, if P[A] = 1/3, the event happens with 
probability 1/3. If P[A] = 1, the event takes place almost certainly. The 
probability measure P has to satisfy obvious properties like that the union 
AUBof two disjoint events A, B satisfies P[A U B] = P[A] + P[B] or that 
the complement A c of an event A has the probability P[A C ] = 1 — P[A\. 
With a probability space (£7, ^4, P) alone, there is already some interesting 
mathematics: one has for example the combinatorial problem to find the 
probabilities of events like the event to get a "royal flush" in poker. If O 
is a subset of an Euclidean space like the plane, P[A] = J A f(x,y) dxdy 
for a suitable nonnegative function /, we are led to integration problems 
in calculus. Actually, in many applications, the probability space is part of 
Euclidean space and the er-algcbra is the smallest which contains all open 
sets. It is called the Borel tr-algebra. An important example is the Borel 
cr-algebra on the real line. 

Given a probability space (0, A, P), one can define random variables X. A 
random variable is a function X from fi to the real line K which is mea- 
surable in the sense that the inverse of a measurable Borel set B in K is 
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in A. The interpretation is that if w is an experiment, then X(oj) mea- 
sures an observable quantity of the experiment. The technical condition of 
measurability resembles the notion of a continuity for a function / from a 
topological space (0, O) to the topological space (K, U). A function is con- 
tinuous if / _1 (C7) G O for all open sets U GU.Itl probability theory, where 
functions are often denoted with capital letters, like X, Y, . . . , a random 
variable X is measurable if X~ 1 (B) G A for all Borel sets B G B. Any 
continuous function is measurable for the Borel cr-algebra. As in calculus, 
where one does not have to worry about continuity most of the time, also in 
probability theory, one often does not have to sweat about measurability is- 
sues. Indeed, one could suspect that notions like c-algebras or measurability 
were introduced by mathematicians to scare normal folks away from their 
realms. This is not the case. Serious issues are avoided with those construc- 
tions. Mathematics is eternal: a once established result will be true also in 
thousands of years. A theory in which one could prove a theorem as well as 
its negation would be worthless: it would formally allow to prove any other 
result, whether true or false. So, these notions are not only introduced to 
keep the theory "clean", they are essential for the "survival" of the theory. 
We give some examples of " paradoxes" to illustrate the need for building 
a careful theory. Back to the fundamental notion of random variables: be- 
cause they are just functions, one can add and multiply them by defining 
(X + Y)(lo) = X{lu) + Y{uo) or (XY)(lu) = X(uj)Y(lu). Random variables 
form so an algebra C. The expectation of a random variable X is denoted 
by E[A] if it exists. It is a real number which indicates the "mean" or "av- 
erage" of the observation A. It is the value, one would expect to measure in 
the experiment. If X = 1b is the random variable which has the value 1 if 
w is in the event B and if co is not in the event B 7 then the expectation of 
X is just the probability of B. The constant random variable X(uj) = a has 
the expectation E[A] = a. These two basic examples as well as the linearity 
requirement E[aA + 6F] = aE[A] +6E[Y] determine the expectation for all 
random variables in the algebra C: first one defines expectation for finite 
sums Y2i=i a i^Bi called elementary random variables, which approximate 
general measurable functions. Extending the expectation to a subset C 1 of 
the entire algebra is part of integration theory. While in calculus, one can 
live with the Riemann integral on the real line, which defines the integral 

by Riemann sums J a f(x) dx ~ i J2i/ne[a b] /(*/ 77 -)i the integral defined in 
measure theory is the Lebesgue integral. The later is more fundamental 
and probability theory is a major motivator for using it. It allows to make 
statements like that the probability of the set of real numbers with periodic 
decimal expansion has probability 0. In general, the probability of A is the 
expectation of the random variable X(x) = f(x) = 1a{x). In calculus, the 
integral J fix) dx would not be defined because a Riemann integral can 
give 1 or depending on how the Riemann approximation is done. Probabil- 
ity theory allows to introduce the Lebesgue integral by defining f(x) dx 
as the limit of - Yh=i f( x i) f° r n 00 ■ w here Xi are random uniformly 
distributed points in the interval [a, b]. This Monte Carlo definition of the 
Lebesgue integral is based on the law of large numbers and is as intuitive 
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to state as the Riemann integral which is the limit of ^ ^2 x =j/ne[a f>] f( x j) 
for n — > oo. 

With the fundamental notion of expectation one can define the variance, 
Var[X] = E[X 2 } - E[X} 2 and the standard deviation a[X] = ^Varpf] of a 
random variable X for which X 2 G C . One can also look at the covariance 
Cov[XY] = E[XY] - E[X]E[Y] of two random variables X,Y for which 
X 2 ,Y 2 G C 1 . The correlation Corr[X,Y"] = Cov[XY] / \a[X]a[Y}) of two 
random variables with positive variance is a number which tells how much 
the random variable X is related to the random variable Y. If E[J7] is 
interpreted as an inner product, then the standard deviation is the length 
of X — E[X] and the correlation has the geometric interpretation as cos(a), 
where a is the angle between the centered random variables X — E[X] and 
Y - E[Y]. For example, if Cov[X, Y] = 1, then Y = XX for some A > 0, if 
Cov[X, Y] = —1, they are anti-parallel. If the correlation is zero, the geo- 
metric interpretation is that the two random variables are perpendicular. 
Decorrelated random variables still can have relations to each other but if 
for any measurable real functions / and g, the random variables f(X) and 
g(Y) are uncorrelated, then the random variables X, Y are independent. 

A random variable X can be described well by its distribution function 
Fx- This is a real- valued function defined as Fx(s) — P[X < s] on R, 
where {X < s } is the event of all experiments u> satisfying X(u>) < s. The 
distribution function does not encode the internal structure of the random 
variable X; it does not reveal the structure of the probability space for ex- 
ample. But the function Fx allows the construction of a probability space 
with exactly this distribution function. There are two important types of 
distributions, continuous distributions with a probability density function 
fx = F' x and discrete distributions for which F is piecewise constant. An 
example of a continuous distribution is the standard normal distribution, 
where fx(x) = e~ x ' 2 /v2~7r. One can characterize it as the distribution 
with maximal entropy /(/) = — J \og(f(x))f(x) dx among all distributions 
which have zero mean and variance 1. An example of a discrete distribu- 
tion is the Poisson distribution P[X = k] = e~ A ^r on N = {0, 1,2,... }. 
One can describe random variables by their moment generating functions 
Mx(t) = E[e xt ] or by their characteristic function <f>x(t) = E[e ixt ]. The 
later is the Fourier transform of the law fix = F' x which is a measure on 
the real line R. 

The law fix of the random variable is a probability measure on the real 
line satisfying fix{{a, b]) = Fx(b) — Fx (a). By the Lcbcsgue decomposition 
theorem, one can decompose any measure fi into a discrete part fi pp , an 
absolutely continuous part fi ac and a singular continuous part fi sc . Random 
variables X for which fix is a discrete measure are called discrete random 
variables, random variables with a continuous law are called continuous 
random variables. Traditionally, these two type of random variables are 
the most important ones. But singular continuous random variables appear 
too: in spectral theory, dynamical systems or fractal geometry. Of course, 
the law of a random variable X does not need to be pure. It can mix the 
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three types. A random variable can be mixed discrete and continuous for 
example. 

Inequalities play an important role in probability theory. The Chebychev 
inequality P[\X — E[A]| > c] < v ™l X ^ is used very often. It is a spe- 
cial case of the Chebychev-Markov inequality h(c) ■ P[X > c] < E[/i(X)] 
for monotone nonnegative functions h. Other inequalities are the Jensen 
inequality E[7i(X)] > /i(E[A]) for convex functions h, the Minkowski in- 
equality \\X + Y\\ p < \\X\\ P + \\Y\\ p or the Holder inequality \\XY\\i < 
ll^llpll^llg) 1/P + l/<7 = 1 f° r random variables, X,Y, for which ||A|| p = 
E[|A| P ], ||y|| g = E[|l^| g ] are finite. Any inequality which appears in analy- 
sis can be useful in the toolbox of probability theory. 

Independence is a central notion in probability theory. Two events A, B 
are called independent, if P[A D B] = P[A] ■ P[B]. An arbitrary set of 
events A4 is called independent, if for any finite subset of them, the prob- 
ability of their intersection is the product of their probabilities. Two o- 
algebras A, B are called independent, if for any pair A £ A,B £ B, the 
events A, B are independent. Two random variables X, Y are independent, 
if they generate independent cr-algebras. It is enough to check that the 
events A = {X £ (a, fo)} and B = {Y £ (c, d)} arc independent for 
all intervals (a, b) and (c,d). One should think of independent random 
variables as two aspects of the laboratory Q which do not influence each 
other. Each event ^4 = {a<A(u;)<6}is independent of the event 
B = {c < Y(lo) < d }. While the distribution function Fx+y of the sum of 
two independent random variables is a convolution L Fx(t—s) dFy(s), the 
moment generating functions and characteristic functions satisfy the for- 
mulas M x+ Y{t) = M x (t)M Y (t) and (f) X +Y(t) = 0x(*)<M*)- These identi- 
ties make Mx, <px valuable tools to compute the distribution of an arbitrary 
finite sum of independent random variables. 

Independence can also be explained using conditional probability with re- 
spect to an event B of positive probability: the conditional probability 
P[A|£?] =P[An B]/P[B] of A is the probability that A happens when we 
know that B takes place. If B is independent of A, then P[A|B] = P[A] but 
in general, the conditional probability is larger. The notion of conditional 
probability leads to the important notion of conditional expectation E[A|£?] 
of a random variable X with respect to some sub-er-algebra B of the o al- 
gebra A] it is a new random variable which is 23-measurable. For B = A, it 
is the random variable itself, for the trivial algebra B = {0,^ }, we obtain 
the usual expectation E[X] — E[A"|{0,f2 }]. If B is generated by a finite 
partition B\,.,., B n of Vl of pairwise disjoint sets covering f2, then E[A"|i3] 
is piecewise constant on the sets Bi and the value on Bi is the average 
value of X on Bi. If B is the cr-algcbra of an independent random variable 
Y, then E[X|Y] = E[X|£>] = E[A]. In general, the conditional expectation 
with respect to B is a new random variable obtained by averaging on the 
elements of B. One has E[A|Y] = h(Y) for some function h, extreme cases 
being E[A|1] = E[A], E[A|A] = X. An illustrative example is the situation 
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where X{x, y) is a continuous function on the unit square with P = dxdy 
as a probability measure and where Y{x,y) = x. In that case, E[X|y] is 
a function of x alone, given by E[X\Y}(x) = /„ f(x,y) dy. This is called a 
conditional integral. 

A set {Xt]t£T of random variables defines a stochastic process. The vari- 
able t £ T is a parameter called " time" . Stochastic processes are to prob- 
ability theory what differential equations are to calculus. An example is a 
family X n of random variables which evolve with discrete time n E N. De- 
terministic dynamical system theory branches into discrete time systems, 
the iteration of maps and continuous time systems, the theory of ordinary 
and partial differential equations. Similarly, in probability theory, one dis- 
tinguishes between discrete time stochastic processes and continuous time 
stochastic processes. A discrete time stochastic process is a sequence of ran- 
dom variables X n with certain properties. An important example is when 
X n are independent, identically distributed random variables. A continuous 
time stochastic process is given by a family of random variables X t , where 
t is real time. An example is a solution of a stochastic differential equation. 
With more general time like 1 d or M. d random variables are called random 
fields which play a role in statistical physics. Examples of such processes 
arc percolation processes. 

While one can realize every discrete time stochastic process X n by a measure- 
preserving transformation T : ft — > f2 and X n (u>) = X(T n (uj)), probabil- 
ity theory often focuses a special subclass of systems called martingales, 
where one has a filtration A n C An+i of a- algebras such that X n is A n - 
measurable and E[A„|„4„_i] = X n _\, where E[Jf„|^4 n _i] is the conditional 
expectation with respect to the sub- algebra A n -i- Martingales are a pow- 
erful generalization of the random walk, the process of summing up IID 
random variables with zero mean. Similar as ergodic theory, martingale 
theory is a natural extension of probability theory and has many applica- 
tions. 

The language of probability fits well into the classical theory of dynam- 
ical systems. For example, the ergodic theorem of Birkhoff for measure- 
preserving transformations has as a special case the law of large numbers 
which describes the average of partial sums of random variables i J2T=i -^k- 
There are different versions of the law of large numbers. "Weak laws" 
make statements about convergence in probability, "strong laws" make 
statements about almost everywhere convergence. There are versions of 
the law of large numbers for which the random variables do not need to 
have a common distribution and which go beyond Birkhoff's theorem. An 
other important theorem is the central limit theorem which shows that 
S n = X\ + X2 + ■ ■ ■ + X n normalized to have zero mean and variance 1 
converges in law to the normal distribution or the law of the iterated loga- 
rithm which says that for centered independent and identically distributed 
Xk, the scaled sum S n / A n has accumulation points in the interval [— a, a] 
if A„ = y/2n log log n and a is the standard deviation of Xk . While stating 
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the weak and strong law of large numbers and the central limit theorem, 
different convergence notions for random variables appear: almost sure con- 
vergence is the strongest, it implies convergence in probability and the later 
implies convergence convergence in law. There is also /^-convergence which 
is stronger than convergence in probability. 

As in the deterministic case, where the theory of differential equations is 
more technical than the theory of maps, building up the formalism for 
continuous time stochastic processes X t is more elaborate. Similarly as 
for differential equations, one has first to prove the existence of the ob- 
jects. The most important continuous time stochastic process definitely is 
Brownian motion B t . Standard Brownian motion is a stochastic process 
which satisfies Bq = 0, E[B t ] = 0, Cov[B s , B t ] = s for s < t and for 
any sequence of times, = to < t\ < • • • < U < ij+i, the increments 
B ti+1 — B ti are all independent random vectors with normal distribution. 
Brownian motion B t is a solution of the stochastic differential equation 
■^B t = C(t), where is called white noise. Because white noise is only 
defined as a generalized function and is not a stochastic process by itself, 
this stochastic differential equation has to be understood in its integrated 
form B t = f Q dB s = J Q ((s) ds. 

More generally, a solution to a stochastic differential equation 4fX-t = 
f(X t )((t) + g(Xt) is defined as the solution to the integral equation X t = 
Xq + J Q f(X 8 ) dB t + J Q g(X s ) ds. Stochastic differential equations can 

be defined in different ways. The expression J* f(X s ) dB t can either be 
defined as an Ito integral, which leads to martingale solutions, or the 
Stratonovich integral, which has similar integration rules than classical 
differentiation equations. Examples of stochastic differential equations are 
f t X t = X t C,{t) which has the solution X t = e 3 *' 1 / 2 . Or j- t X t = B?((t) 
which has as the solution the process X t = B\ — lOSj 5 + !5B t . The key tool 
to solve stochastic differential equations is Ito's formula f(B t ) — /(-Bo) = 
L f'(B s )dB s + | J Q f"(B s ) ds, which is the stochastic analog of the fun- 
damental theorem of calculus. Solutions to stochastic differential equations 
are examples of Markov processes which show diffusion. Especially, the so- 
lutions can be used to solve classical partial differential equations like the 
Dirichlet problem Au = in a bounded domain D with u = f on the 
boundary 5D. One can get the solution by computing the expectation of 
/ at the end points of Brownian motion starting at x and ending at the 
boundary u = E x [/(_Bt)]- On a discrete graph, if Brownian motion is re- 
placed by random walk, the same formula holds too. Stochastic calculus is 
also useful to interpret quantum mechanics as a diffusion processes [75, 73] 
or as a tool to compute solutions to quantum mechanical problems using 
Feynman-Kac formulas. 

Some features of stochastic process can be described using the language of 
Markov operators P, which are positive and expectation-preserving trans- 
formations on £ . Examples of such operators arc Perron- Frobenius op- 
erators X — > X(T) for a measure preserving transformation T defining a 
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discrete time evolution or stochastic matrices describing a random walk 
on a finite graph. Markov operators can be defined by transition proba- 
bility functions which are measure- valued random variables. The interpre- 
tation is that from a given point uj, there are different possibilities to go 
to. A transition probability measure V(ui,-) gives the distribution of the 
target. The relation with Markov operators is assured by the Chapman- 
Kolmogorov equation P"+ m = P" o P m . Markov processes can be obtained 
from random transformations, random walks or by stochastic differential 
equations. In the case of a finite or countable target space S, one obtains 
Markov chains which can be described by probability matrices P, which 
are the simplest Markov operators. For Markov operators, there is an ar- 
row of time: the relative entropy with respect to a background measure 
is non-increasing. Markov processes often are attracted by fixed points of 
the Markov operator. Such fixed points are called stationary states. They 
describe equilibria and often they are measures with maximal entropy. An 
example is the Markov operator P, which assigns to a probability density 
Jy the probability density of f Y +x wnere Y + X is the random variable 
Y + X normalized so that it has mean and variance 1. For the initial 
function / = 1, the function P n (fx) is the distribution of «S* the nor- 
malized sum of n IID random variables Xi. This Markov operator has a 
unique equilibrium point, the standard normal distribution. It has maxi- 
mal entropy among all distributions on the real line with variance 1 and 
mean 0. The central limit theorem tells that the Markov operator P has 
the normal distribution as a unique attracting fixed point if one takes the 
weaker topology of convergence in distribution on C . This works in other 
situations too. For circle- valued random variables for example, the uniform 
distribution maximizes entropy. It is not surprising therefore, that there is 
a central limit theorem for circle- valued random variables with the uniform 
distribution as the limiting distribution. 

In the same way as mathematics reaches out into other scientific areas, 
probability theory has connections with many other branches of mathe- 
matics. The last chapter of these notes give some examples. The section 
on percolation shows how probability theory can help to understand criti- 
cal phenomena. In solid state physics, one considers operator-valued ran- 
dom variables. The spectrum of random operators are random objects too. 
One is interested what happens with probability one. Localization is the 
phenomenon in solid state physics that sufficiently random operators of- 
ten have pure point spectrum. The section on estimation theory gives a 
glimpse of what mathematical statistics is about. In statistics one often 
does not know the probability space itself so that one has to make a statis- 
tical model and look at a parameterization of probability spaces. The goal 
is to give maximum likelihood estimates for the parameters from data and 
to understand how small the quadratic estimation error can be made. A 
section on Vlasov dynamics shows how probability theory appears in prob- 
lems of geometric evolution. Vlasov dynamics is a generalization of the 
n-body problem to the evolution of of probability measures. One can look 
at the evolution of smooth measures or measures located on surfaces. This 
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deterministic stochastic system produces an evolution of densities which 
can form singularities without doing harm to the formalism. It also defines 
the evolution of surfaces. The section on moment problems is part of multi- 
variate statistics. As for random variables, random vectors can be described 
by their moments. Since moments define the law of the random variable, 
the question arises how one can see from the moments, whether we have a 
continuous random variable. The section of random maps is an other part 
of dynamical systems theory. Randomized versions of diffcomorphisms can 
be considered idealization of their undisturbed versions. They often can 
be understood better than their deterministic versions. For example, many 
random diffeomorphisms have only finitely many ergodic components. In 
the section in circular random variables, we see that the Mises distribu- 
tion has extremal entropy among all circle- valued random variables with 
given circular mean and variance. There is also a central limit theorem 
on the circle: the sum of IID circular random variables converges in law 
to the uniform distribution. We then look at a problem in the geometry 
of numbers: how many lattice points are there in a neighborhood of the 
graph of one-dimensional Brownian motion? The analysis of this problem 
needs a law of large numbers for independent random variables X). with 
uniform distribution on [0, 1]: for < 8 < 1, and A„ = [0, l/n s ] one has 
linijj^oo — X)fe=i lA ™n* k ' > = Probability theory also matters in complex- 
ity theory as a section on arithmetic random variables shows. It turns out 
that random variables like X n (k) = k, Y n (k) = k 2 + 3 mod n defined on 
finite probability spaces become independent in the limit n — > oo. Such 
considerations matter in complexity theory: arithmetic functions defined 
on large but finite sets behave very much like random functions. This is 
reflected by the fact that the inverse of arithmetic functions is in general 
difficult to compute and belong to the complexity class of NP. Indeed, if 
one could invert arithmetic functions easily, one could solve problems like 
factoring integers fast. A short section on Diophantine equations indicates 
how the distribution of random variables can shed light on the solution 
of Diophantine equations. Finally, we look at a topic in harmonic analy- 
sis which was initiated by Norbert Wiener. It deals with the relation of 
the characteristic function <j>x and the continuity properties of the random 
variable X. 



1.2 Some paradoxes in probability theory 

Colloquial language is not always precise enough to tackle problems in 
probability theory. Paradoxes appear, when definitions allow different in- 
terpretations. Ambiguous language can lead to wrong conclusions or con- 
tradicting solutions. To illustrate this, we mention a few problems. For 
many more, see [110]. The following four examples should serve as a mo- 
tivation to introduce probability theory on a rigorous mathematical footing. 

1) Bertrand's paradox (Bertrand 1889) 

We throw random lines onto the unit disc. What is the probability that 
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the line intersects the disc with a length > s/3, the length of the inscribed 
equilateral triangle? 

First answer: take an arbitrary point P on the boundary of the disc. The 
set of all lines through that point arc parameterized by an angle <fi. In order 
that the chord is longer than the line has to lie within a sector of 60° 
within a range of 180°. The probability is 1/3. 

Second answer: take all lines perpendicular to a fixed diameter. The chord 
is longer than V3 if the point of intersection lies on the middle half of the 
diameter. The probability is 1/2. 

Third answer: if the midpoints of the chords lie in a disc of radius 1/2, the 
chord is longer than a/3- Because the disc has a radius which is half the 
radius of the unit disc, the probability is 1/4. 




Like most paradoxes in mathematics, a part of the question in Bcrtrand's 
problem is not well defined. Here it is the term "random line". The solu- 
tion of the paradox lies in the fact that the three answers depend on the 
chosen probability distribution. There are several "natural" distributions. 
The actual answer depends on how the experiment is performed. 

2) Petersburg paradox (D.Bernoulli, 1738) 

In the Petersburg casino, you pay an entrance fee c and you get the prize 
2 T , where T is the number of times, the casino flips a coin until "head" 
appears. For example, if the sequence of coin experiments would give "tail, 
tail, tail, head" , you would win 2 3 — c = 8 — c, the win minus the entrance 
fee. Fair would be an entrance fee which is equal to the expectation of the 
win, which is 

oo oo 

^2 fe P[T = k] =J2 1 = 00 • 

k=l k=l 



The paradox is that nobody would agree to pay even an entrance fee c = 10. 
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The problem with this casino is that it is not quite clear, what is "fair". 
For example, the situation T = 20 is so improbable that it never occurs 
in the life-time of a person. Therefore, for any practical reason, one has 
not to worry about large values of T. This, as well as the finiteness of 
money resources is the reason, why casinos do not have to worry about the 
following bullet proof martingale strategy in roulette: bet c dollars on red. 
If you win, stop, if you lose, bet 2c dollars on red. If you win, stop. If you 
lose, bet 4c dollars on red. Keep doubling the bet. Eventually after n steps, 
red will occur and you will win 2"c — (c + 2c + • • • + 2' l ~ 1 c) = c dollars. 
This example motivates the concept of martingales. Theorem (3.2.7) or 
proposition (3.2.9) will shed some light on this. Back to the Petersburg 
paradox. How does one resolve it? What would be a reasonable entrance 
fee in "real life"? Bernoulli proposed to replace the expectation E[G] of the 
profit G — 2 T with the expectation (Ef-s/G]) 2 , where u(x) = \fx is called a 
utility function. This would lead to a fair entrance 

mVG}) 2 = (V 2 fc / 2 2- fe ) 2 = — = 5.828... . 

It is not so clear if that is a way out of the paradox because for any proposed 
utility function u(k), one can modify the casino rule so that the paradox 
reappears: pay (2 k ) 2 if the utility function u(k) = \fk or pay e 2 ^ dollars, 
if the utility function is u(k) = log(fc). Such reasoning plays a role in 
economics and social sciences. 



Figure. The picture to the right 
shows the average profit devel- 
opment during a typical tourna- 
ment of 4000 Petersburg games. 
After these J^OOO games, the 
player would have lost about 10 
thousand dollars, when paying a 
10 dollar entrance fee each game. 
The player would have to play a 
very, very long time to catch up. 
Mathematically, the player will 
do so and have a profit in the 
long run, but it is unlikely that 
it will happen in his or her life 
time. 



V 



f 



3) The three door problem (1991) Suppose you're on a game show and 
you are given a choice of three doors. Behind one door is a car and behind 
the others are goats. You pick a door-say No. 1 - and the host, who knows 
what's behind the doors, opens another door-say, No. 3-which has a goat. 
(In all games, he opens a door to reveal a goat). He then says to you, "Do 
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you want to pick door No. 2?" (In all games he always offers an option to 
switch) . Is it to your advantage to switch your choice? 

The problem is also called "Monty Hall problem" and was discussed by 
Marilyn vos Savant in a "Parade" column in 1991 and provoked a big 
controversy. (See [102] for pointers and similar examples and [90] for much 
more background.) The problem is that intuitive argumentation can easily 
lead to the conclusion that it does not matter whether to change the door 
or not. Switching the door doubles the chances to win: 

No switching: you choose a door and win with probability 1/3. The opening 
of the host does not affect any more your choice. 

Switching: when choosing the door with the car, you loose since you switch. 
If you choose a door with a goat. The host opens the other door with the 
goat and you win. There are two such cases, where you win. The probability 
to win is 2/3. 

4) The Banach-Tarski paradox (1924) 

It is possible to cut the standard unit ball = {x e R 3 | \x\ < 1 } into 5 
disjoint pieces ft = Y\ U Y% U I3 U Y4 U Y$ and rotate and translate the pieces 
with transformations T. t so that Xi (Yi ) UT 2 (F 2 ) = fi and T 3 (Y 3 ) U T 4 (Y 4 ) U 
T 5 (Y 5 ) = Q, 1 is a second unit ball Q, 1 = {x e K 3 | \x - (3, 0, 0)| < 1} and all 
the transformed sets again don't intersect. 

While this example of Banach-Tarski is spectacular, the existence of bounded 
subsets A of the circle for which one can not assign a translational invari- 
ant probability P[A] can already be achieved in one dimension. The Italian 
mathematician Giuseppe Vitali gave in 1905 the following example: define 
an equivalence relation on the circle T = [0, 2n) by saying that two angles 
arc equivalent x ~ y if (a; — y)/^ is a rational angle. Let A be a subset in the 
circle which contains exactly one number from each equivalence class. The 
axiom of choice assures the existence of A. If x\, x%, ■ ■ ■ is a enumeration 
of the set of rational angles in the circle, then the sets A4 = A + Xi are 
pairwise disjoint and satisfy (J^i ^« = ^- ^ r wc could assign a translational 
invariant probability P[Aj] to A, then the basic rules of probability would 
give 

OO OO OO 

! I> : l>U-l. » ' l < »• 

i— 1 i—1 i—1 

But there is no real number p = P[A] = P[Ai] which makes this possible. 
Both the Banach-Tarski as well as Vitalis result shows that one can not 
hope to define a probability space on the algebra A of all subsets of the unit 
ball or the unit circle such that the probability measure is translational 
and rotational invariant. The natural concepts of "length" or "volume", 
which are rotational and translational invariant only makes sense for a 
smaller algebra. This will lead to the notion of <r-algebra. In the context 
of topological spaces like Euclidean spaces, it leads to Borel tr-algebras, 
algebras of sets generated by the compact sets of the topological space. 
This language will be developed in the next chapter. 
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1.3 Some applications of probability theory 



Probability theory is a central topic in mathematics. There are close re- 
lations and intersections with other fields like computer science, crgodic 
theory and dynamical systems, cryptology, game theory, analysis, partial 
differential equation, mathematical physics, economical sciences, statistical 
mechanics and even number theory. As a motivation, we give some prob- 
lems and topics which can be treated with probabilistic methods. 

1) Random walks: (statistical mechanics, gambling, stock markets, quan- 
tum field theory). 

Assume you walk through a lattice. At each vertex, you choose a direction 
at random. What is the probability that you return back to your start- 
ing point? Polya's theorem (3.8.1) says that in two dimensions, a random 
walker almost certainly returns to the origin arbitrarily often, while in three 
dimensions, the walker with probability 1 only returns a finite number of 
times and then escapes for ever. 





Figure. A random 
walk in one dimen- 
sions displayed as a 
graph (t, Bt). 



Figure. A piece of a 
random walk in two 
dimensions. 




Figure. A piece of a 
random walk in three 
dimensions. 



2) Percolation problems (model of a porous medium, statistical mechanics, 
critical phenomena). 

Each bond of a rectangular lattice in the plane is connected with probability 
p and disconnected with probability 1 — p. Two lattice points x, y in the 
lattice are in the same cluster, if there is a path from x to y. One says that 
"percolation occurs" if there is a positive probability that an infinite cluster 
appears. One problem is to find the critical probability p c , the infimum of all 
p, for which percolation occurs. The problem can be extended to situations, 
where the switch probabilities are not independent to each other. Some 
random variables like the size of the largest cluster are of interest near the 
critical probability p c . 
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Figure. Bond percola- 
tion with p=0.2. 



Figure. Bond percola- 
tion with p~0-4- 



Figure. Bond percola- 
tion with p=0.6. 



A variant of bond percolation is site percolation where the nodes of the 
lattice are switched on with probability p. 



Figure. Siie percola- 
tion with p=0.2. 




Figure. Site percola- 
tion with p=0.4- 




Figure. Site percola- 
tion with p=0.6. 



Generalized percolation problems are obtained, when the independence 
of the individual nodes is relaxed. A class of such dependent percola- 
tion problems can be obtained by choosing two irrational numbers a, f3 
like a = \J~2 — 1 and (3 = \/3 — 1 and switching the node (n, m) on if 
(na + m(3) mod 1 G [0,p). The probability of switching a node on is again 
but the random variables 



X n ,m, — l(na+?n/3) mod l£[0,p) 



are no more independent. 
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Figure. Dependent 
site percolation with 
p=0.2. 




Figure. Dependent 
site percolation with 
p=0.l 
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Figure. Dependent 
site percolation with 
p=0.6. 



Even more general percolation problems are obtained, if also the distribu- 
tion of the random variables X n ^ n can depend on the position (n, m). 



3) Random Schrodinger operators, (quantum mechanics, functional analy- 
sis, disordered systems, solid state physics) 



Consider the linear map Lu{n) = Yl\ m -n\=i u ( n ) + V(n)u(n) on the space 
of sequences u = (. . . , w_2, tt-i, Uo, u\, U2, • ■ • )■ We assume that V(n) takes 
random values in {0, 1}. The function V is called the potential. The problem 
is to determine the spectrum or spectral type of the infinite matrix L on 
the Hilbert space I 2 of all sequences u with finite \\u\\\ = X)^L-oo u n- 
The operator L is the Hamiltonian of an electron in a one-dimensional 
disordered crystal. The spectral properties of L have a relation with the 
conductivity properties of the crystal. Of special interest is the situation, 
where the values V(n) are all independent random variables. It turns out 
that if V (n) are IID random variables with a continuous distribution, there 
are many eigenvalues for the infinite dimensional matrix L - at least with 
probability 1. This phenomenon is called localization. 
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Figure. A wave 
ip(t) = e iLt ip(0) 
evolving in a random 
potential at t = 0. 
Shown are both the 
potential V n and the 
wave ip(0). 



of probability theory 




Figure. A wave 
ip(t) = e iLt ip(0) 
evolving in a random 
potential at t = 1. 
Shown are both the 
potential V n and the 
wave ip(l). 
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Figure. A wave 
ip(t) = e lLt ip{Q) 
evolving in a random 
potential at t = 2. 
Shown are both the 
potential V n and the 
wave V'(2)- 



More general operators are obtained by allowing V(n) to be random vari- 
ables with the same distribution but where one does not persist on indepen- 
dence any more. A well studied example is the almost Mathieu operator, 
where V(n) = Acos(# + na) and for which a/(2ir) is irrational. 



4) Classical dynamical systems (celestial mechanics, fluid dynamics, me- 
chanics, population models) 



The study of deterministic dynamical systems like the logistic map x i— > 
4x(l — x) on the interval [0, 1] or the three body problem in celestial me- 
chanics has shown that such systems or subsets of it can behave like random 
systems. Many effects can be described by ergodic theory, which can be 
seen as a brother of probability theory. Many results in probability the- 
ory generalize to the more general setup of ergodic theory. An example is 
Birkhoff's ergodic theorem which generalizes the law of large numbers. 
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Figure. Iterating the 
logistic map 



on [0, 1] produces 
independent random 
variables. The in- 
variant measure P is 
continuous. 



T{x) = Ax{l - x) 



Figure. The simple 
mechanical system of 
a double pendulum 
exhibits complicated 
dynamics. The dif- 
ferential equation 
defines a measure 
preserving flow T t on 
a probability space. 



Figure. A short time 
evolution of the New- 
tonian three body 
problem. There are 
energies and subsets 
of the energy surface 
which are invari- 
ant and on which 
there is an invariant 
probability measure. 



Given a dynamical system given by a map T or a flow T t on a subset fi of 
some Euclidean space, one obtains for every invariant probability measure 
P a probability space (£l,A, P). An observed quantity like a coordinate of 
an individual particle is a random variable X and defines a stochastic pro- 
cess X n (uj) = X(T n oj). For many dynamical systems including also some 3 
body problems, there are invariant measures and obscrvablcs X for which 
X n are IID random variables. Probability theory is therefore intrinsically 
relevant also in classical dynamical systems. 

5) Cryptology. (computer science, coding theory, data encryption) 

Coding theory deals with the mathematics of encrypting codes or deals 
with the design of error correcting codes. Both aspects of coding theory 
have important applications. A good code can repair loss of information 
due to bad channels and hide the information in an encrypted way. While 
many aspects of coding theory are based in discrete mathematics, number 
theory, algebra and algebraic geometry, there are probabilistic and combi- 
natorial aspects to the problem. We illustrate this with the example of a 
public key encryption algorithm whose security is based on the fact that 
it is hard to factor a large integer N ~ pq into its prime factors p, q but 
easy to verify that p, q are factors, if one knows them. The number N can 
be public but only the person, who knows the factors p, q can read the 
message. Assume, we want to crack the code and find the factors p and q. 

The simplest method is to try to find the factors by trial and error but this is 
impractical already if N has 50 digits. We would have to search through 10 25 
numbers to find the factor p. This corresponds to probe 100 million times 
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every second over a time span of 15 billion years. There are better methods 
known and we want to illustrate one of them now: assume we want to find 
the factors of TV = 11111111111111111111111111111111111111111111111. 
The method goes as follows: start with an integer a and iterate the quadratic 
map T(x) — x 2 + c mod TV on {0, 1., , , .N — 1 }. If we assume the numbers 
xq = a, x\ = T(cl), X2 = T(T(a)) ... to be random, how many such numbers 
do we have to generate, until two of them are the same modulo one of the 
prime factors pi The answer is surprisingly small and based on the birthday 
paradox: the probability that in a group of 23 students, two of them have the 
same birthday is larger than 1/2: the probability of the event that we have 
no birthday match is 1(364/365) (363/365) • • • (343/365) = 0.492703 . . . , so 
that the probability of a birthday match is 1 - 0.492703 = 0.507292. This 
is larger than 1/2. If we apply this thinking to the sequence of numbers 
Xi generated by the pseudo random number generator T, then we expect 
to have a chance of 1/2 for finding a match modulo p in ^Jp iterations. 
Because p < y/n, we have to try N 1 / 4 numbers, to get a factor: if x n and 
the same modulo p, then gcd(x n — x m , N) produces the factor p of 
N. In the above example of the 46 digit number N, there is a prime factor 
p = 35121409. The Pollard algorithm finds this factor with probability 1/2 
in y/p = 5926 steps. This is an estimate only which gives the order of mag- 
nitude. With the above N, if we start with a = 17 and take a = 3, then we 
have a match £27720 = a;i3860- It can be found very fast. 

This probabilistic argument would give a rigorous probabilistic estimate 
if we would pick truly random numbers. The algorithm of course gener- 
ates such numbers in a deterministic way and they are not truly random. 
The generator is called a pseudo random number generator. It produces 
numbers which are random in the sense that many statistical tests can 
not distinguish them from true random numbers. Actually, many random 
number generators built into computer operating systems and program- 
ming languages are pseudo random number generators. 

Probabilistic thinking is often involved in designing, investigating and at- 
tacking data encryption codes or random number generators. 

6) Numerical methods, (integration, Monte Carlo experiments, algorithms) 
In applied situations, it is often very difficult to find integrals directly. This 
happens for example in statistical mechanics or quantum electrodynamics, 
where one wants to find integrals in spaces with a large number of dimen- 
sions. One can nevertheless compute numerical values using Monte Carlo 
Methods with a manageable amount of effort. Limit theorems assure that 
these numerical values are reasonable. Let us illustrate this with a very 
simple but famous example, the Buffon needle problem. 

A stick of length 2 is thrown onto the plane filled with parallel lines, all 
of which are distance d = 2 apart. If the center of the stick falls within 
distance y of a line, then the interval of angles leading to an intersection 
with a grid line has length 2arccos(y) among a possible range of angles 
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[0, 7r]. The probability of hitting a line is therefore J* 2 arccos(y)/7r = 2/ it. 
This leads to a Monte Carlo method to compute tt. Just throw randomly 
n sticks onto the plane and count the number k of times, it hits a line. The 
number 2n/k is an approximation of tt. This is of course not an effective 
way to compute tt but it illustrates the principle. 



Figure. The Buffon needle prob- 
lem is a Monte Carlo method 
to compute tt. By counting the 
number of hits in a sequence of 
experiments, one can get ran- 
dom approximations of tt. The 
law of large numbers assures that 
the approximations will converge 
to the expected limit. All Monte 
Carlo computations are theoreti- 
cally based on limit theorems. 




Chapter 2 

Limit theorems 



2.1 Probability spaces, random variables, indepen- 
dence 

Let Cl be an arbitrary set. 

Definition. A set A of subsets of Cl is called a cr-algebra if the following 
three properties are satisfied: 



A pair (CI, A) for which A is a cr-algebra in Cl is called a measurable space. 



Properties. If A is a cr-algebra, and A n is a sequence in A, then the fol- 
lowing properties follow immediately by checking the axioms: 



4) A, B are algebras, then A H B is an algebra. 

5) If {A\}i£i is a family of a- sub- algebras of A. then f] i&1 Ai is a cr-algebra. 



Example. For an arbitrary set CI, A = {0, CI} is a cr-algebra. It is called the 
trivial cr-algebra. 

Example. If CI is an arbitrary set, then A = {A C CI} is a cr-algebra. The 
set of all subsets of CI is the largest cr-algebra one can define on a set. 



(i) CieA, 

(ii) A g A => A c = Cl \ A g A, 

(iii) A n eA^> U„ eN G -4 



!) flneN^G- 4 - 

2) limsup n A„ := 

3) liminf n A„ := 




A„ g A. 
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Example. A finite set of subsets Ax,A2, . . . ,A„ of which are pairwise 
disjoint and whose union is tt, it is called a partition of f2. It generates the 
cr-algebra: A = {A = (J e 7 A,- } where J runs over all subsets of {1, .., n}. 
This cr-algebra has 2™ elements. Every finite cr-algebra is of this form. The 
smallest nonempty elements {Ai, . . . , A n } of this algebra arc called atoms. 

Definition. For any set C of subsets of f2, we can define cr(C), the smallest 
cr-algebra A which contains C. The cr-algebra A is the intersection of all 
cr- algebras which contain C. It is again a cr-algebra. 

Example. For Q = {1,2,3}, the set C = {{1,2}, {2, 3 }} generates the 
cr- algebra A which consists of all 8 subsets of fl. 

Definition. If (E, O) is a topological space, where O is the set of open sets 
in E. then <j(0) is called the Borel cr-algebra of the topological space. If 
A C B, then A is called a subalgebra of B. A set B in B is also called a 
Borel set. 



Remark. One sometimes defines the Borel cr-algcbra as the cr-algcbra gen- 
erated by the set of compact sets C of a topological space. Compact sets 
in a topological space are sets for which every open cover has a finite sub- 
cover. In Euclidean spaces R™, where compact sets coincide with the sets 
which are both bounded and closed, the Borel cr-algebra generated by the 
compact sets is the same as the one generated by open sets. The two def- 
initions agree for a large class of topological spaces like "locally compact 
separable metric spaces" . 

Remark. Often, the Borel cr-algebra is enlarged to the cr-algebra of all 
Lebesgue measurable sets, which includes all sets B which are a subset 
of a Borel set A of measure 0. The smallest cr-algcbra B which contains 
all these sets is called the completion of B. The completion of the Borel 
cr-algebra is the cr-algebra of all Lebesgue measurable sets. It is in general 
strictly larger than the Borel cr-algebra. But it can also have pathological 
features like that the composition of a Lebesgue measurable function with 
a continuous functions does not need to be Lebesgue measurable any more. 
(See [114], Example 2.4). 

Example. The cr-algcbra generated by the open balls C = {A = B r (x) } of 
a metric space (X, d) need not to agree with the family of Borel subsets, 
which are generated by O, the set of open sets in (X,d). 
Proof. Take the metric space (R, d) where d(x, y) = l{ x ^ y } is the discrete 
metric. Because any subset of R is open, the Borel cr-algebra is the set of 
all subsets of R. The open balls in R are either single points or the whole 
space. The cr-algebra generated by the open balls is the set of countable 
subset of R together with their complements. 
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Example. If fl = [0, 1] x [0, 1] is the unit square and C is the set of all sets 
of the form [0, 1] x [a, b] with < a < b < 1, then cr(C) is the cr-algebra of 
all sets of the form [0, 1] x A, where A is in the Borel cr-algebra of [0,1]. 

Definition. Given a measurable space (SI, A). A function P : A — >• R is 
called a probability measure and (ft, A, P) is called a probability space if 
the following three properties called Kolmogorov axioms arc satisfied: 



(i) PL4] > for all A e A, 

(ii) P[fi] = 1, 

(iii) A n G 4 disjoint P[U„ An] = £„ ?K] 



The last property is called cr-additivity. 



Properties. Here are some basic properties of the probability measure 
which immediately follow from the definition: 

1) P[0] =0. 

2) A c B => PL4] < P[B\. 

3) P[U„^n]<E„PK]. 

4) P[A C ] = 1 -PL4]. 

5) < P[A] < 1. 

6) At c A 2 , c ■ • • with A„ g A then P[(X=i A »l = lim «^oo P[A n ]. 



Remark. There are different ways to build the axioms for a probability 
space. One could for example replace (i) and (ii) with properties 4), 5) in 
the above list. Statement 6) is equivalent to cr-additivity if P is only assumed 
to be additive. 

Remark. The name "Kolmogorov axioms" honors a monograph of Kol- 
mogorov from 1933 [54] in which an axiomatization appeared. Other math- 
ematicians have formulated similar axiomatizations at the same time, like 
Hans Reichenbach in 1932. According to Doob, axioms (i)-(iii) were first 
proposed by G. Bohlmann in 1908 [22]. 

Definition. A map X from a measure space (f2, A) to an other measure 
space (A, 23) is called measurable, if X~ 1 (B) G A for all B G 23. The set 
X~ 1 (B) consists of all points x G fl for which X(x) G B. This pull back set 
X _1 (B) is defined even if X is non-invertible. For example, for X(x) = x 2 
on (R,23) one has X^^IA]) = [1,2]U [-2,-1]. 



Definition. A function X : O — > R is called a random variable, if it is a 

measurable map from (SI, A) to (R, 23), where 23 is the Borel cr-algebra of 
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K. Denote by £ the set of all real random variables. The set £ is an alge- 
bra under addition and multiplication: one can add and multiply random 
variables and gets new random variables. More generally, one can consider 
random variables taking values in a second measurable space (E,B). If 
E = M. d , then the random variable X is called a random vector. For a ran- 
dom vector X = {X\, . . . , Xd), each component Xi is a random variable. 

Example. Let £1 = M 2 with Borel cr-algebra A and let 

P[A] = — [ [ e -(* 2 -y 2 )/ 2 dxdy . 
2tt J J A 

Any continuous function X of two variables is a random variable on £1. For 
example, X(x, y) = xy(x + y) is a random variable. But also X(x, y) = 
l/(x + y) is a random variable, even so it is not continuous. The vector- 
valued function X(x, y) = (x, y, x 3 ) is an example of a random vector. 

Definition. Every random variable X defines a cr-algebra 

X- 1 (B)={X~ 1 (B) Be 8} . 
We denote this algebra by <j(X) and call it the cr-algebra generated by X. 

Example. A constant map X(x) = c defines the trivial algebra A = {0, £1 }. 

Example. The map X(x,y) = x from the square f2 = [0, 1] x [0, 1] to the 
real line R defines the algebra B = {A x [0, 1] }, where A is in the Borel 
cr-algebra of the interval [0, 1]. 

Example. The map X from Z 6 = {0, 1, 2, 3, 4, 5} to {0, l}cl defined by 
X(x) — x mod 2 has the value X (x) = if x is even and X(x) = 1 if x is 
odd. The cr-algebra generated by X is A = {0, {1, 3, 5}, {0, 2, 4}, Q, }. 

Definition. Given a set B 6 A with P[B] > 0, wc define 

the conditional probability of A with respect to B. It is the probability of 
the event A, under the condition that the event B happens. 

Example. We throw two fair dice. Let A be the event that the first dice is 
6 and let B be the event that the sum of two dices is 11. Because P[B] = 
2/36 = 1/18 and P[A n B] = 1/36 (we need to throw a 6 and then a 5), 
we have P[A\B] = (1/16)/(1/18) = 1/2. The interpretation is that since 
we know that the event B happens, we have only two possibilities: (5, 6) 
or (6, 5). On this space of possibilities, only the second is compatible with 
the event B. 
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Exercise. In [28], Martin Gardner writes: "Ask someone to name two faces 
of a die. Suppose he names 2 and 5. Let him throw a pair of dice as often 
as he wishes. Each time you bet at even odds that either the 2 or the 5 or 
both will show." Is this a good bet? 



Exercise, a) Verify that the Sicherman dices with faces (1, 3, 4, 5, 6, 8) and 

(1,2,2,3,3,4) have the property that the probability of getting the value 
k is the same as with a pair of standard dice. For example, the proba- 
bility to get 5 with the Sicherman dices is 4/36 because the three cases 
(1, 4), (3, 2), (3, 2), (4, 1) lead to a sum 5. Also for the standard dice, we 
have four cases (1, 4), (2, 3), (3, 2), (4, 1). 

b) Three dices A, B, C are called non-transitive, if the probability that A > 
B is larger than 1/2, the probability that B > C is larger than 1/2 and the 
probability that C > A is larger than 1/2. Verify the non-transitivity prop- 
erty for A = (1, 4, 4, 4, 4, 4), B = (3, 3, 3, 3, 3, 6) and C = (2, 2, 2, 5, 5, 5). 



Properties. The following properties of conditional probability are called 
Keynes postulates. While they follow immediately from the definition 
of conditional probability, they are historically interesting because they 
appeared already in 1921 as part of an axiomatization of probability theory: 



1) P[A\B] > 0. 

2) P[A\A] = 1. 

3) P[A\B] + P[A C \B] = 1. 

4) P[A n B\C] = P[A\C] ■ P[B\A n C}. 



Definition. A finite set {Ai, . . . , A n } C A is called a finite partition of f2 if 

U?=i Aj = f2 and Aj n Ai = for i ^ j. A finite partition covers the entire 
space with finitely many, pairwise disjoint sets. 

If all possible experiments are partitioned into different events Aj and the 
probabilities that B occurs under the condition Aj, then one can compute 
the probability that Ai occurs knowing that B happens: 



Theorem 2.1.1 (Bayes rule). Given a finite partition {Ai, ..,A n } in A and 
B £ A with P[B] > 0, one has 
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Proof. Because the denominator is P[B] = ^2j = iP[B\Aj]P[Aj], the Bayes 
rule just says P[Ai|B]P[S] = P[B|Aj]P[j4j]. But these are by definition 
both P[Ai n B]. □ 

Example. A fair dice is rolled first. It gives a random number k from 
{1,2, 3, 4, 5, 6}. Next, a fair coin is tossed k times. Assume, we know that 
all coins show heads, what is the probability that the score of the dice was 
equal to 5? 

Solution. Let B be the event that all coins are heads and let Aj be the 
event that the dice showed the number j. The problem is to find P[y4.5|S]. 
We know P[B|Aj] = 2~ J . Because the events Aj, j = 1, . . . , 6 form a par- 
tition of 0, we have P[B] = Y? ]=1 P[B n Aj] = Y? j=1 P[B\A 3 ]P[A 3 ] = 



Ej=i 2"76 
Bayes rule, 



(1/2 + 1/4 + 1/8 + 1/16 + 1/32 + l/64)(l/6) = 21/128. By 



P[As\B] = 



PL9|A 5 ]PL4 5 



.^[BlAjnAj]) 21/128 



(l/32)(l/6) = 2_ 
63 



Figure. The probabilities 
P[Aj\B] in the last problem 




Example. The Girl-Boy problem has been popularized by Martin Gardner: 
"Dave has two children. One child is a boy. What is the probability that 
the other child is a girl" ? 



Most people would intuitively say 1/2 because the second event looks in- 
dependent of the first. However, it is not and the initial intuition is mis- 
leading. Here is the solution: first introduce the probability space of all 
possible events fi = {bg,gb,bb,gg } with P[{bg }] = P[{gb }] = P[{bb }] = 
P[{gg }] = 1/4. Let B = {bg, gb, bb } be the event that there is at least one 
boy and A = {gb,bg,gg } be the event that there is at least one girl. We 
have 

P[ADB] _ (1/2) _ 2 
P[B] ~ (3/4) ~ 3 ' 



P[A\B] 
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Example. A variant of the Boy-Girl problem is due to Gary Foshee [84]. 
We formulate it in a simplified form: " Dave has two children, one of whom 
is a boy born at night. What is the probability that Dave has two boys?" 
It is assumed of course that the probability to have a boy (b) or girl (17) 
is 1/2 and that the probability to be born at night (n) or day (d) is 1/2 
too. One would think that the additional information "to be born at night" 
does not influence the probability and that the overall answer is still 1/3 
like in the boy-girl problem. But this is not the case. The probability space 
of all events has 12 elements ft = {(bd)(bd), (bd)(bn), (bn)(bd), (bn)(bn), 
{bd)(gd), (bd)(gn), (bn)(gd), (bn)(gn), (gd)(bd), (gd)(bn), (gn)(bd), (gn)(bn), 
(gd)(gd), (gd)(gn), (gn)(gd), (gn)(gn) }. The information that one of the 
kids is a boy eliminates the last 4 examples. The information that the boy 
is born at night only allows pairings (bn) and eliminates all cases with (bd) 
if there is not also a (bn) there. We are left with an event B containing 7 
cases which encodes the information that one of the kids is a boy born at 
night: 

B = {(bd)(bn), (bn)(bd), (bn)(bn), (bn)(gd), (bn)(gn), (gd)(bn), (gn)(bn) } . 

The event A that Dave has two boys is A = {(bd)(bn), (bn)(bd), (bn)(bn) }. 
The answer is the conditional probability P[A\B] = P[A n B]/P[B] = 3/7. 
This is bigger than 1/3 the probability without the knowledge of being 
born at night. 



Exercise. Solve the original Foshee problem: "Dave has two children, one 
of whom is a boy born on a Tuesday. What is the probability that Dave 
has two boys?" 



Exercise. This version is close to the original Gardner paradox: 

a) I throw two dice onto the floor. A friend who stands nearby looks at 
them and tells me: "At least one of them is head". What is the probability 
that the other is head? 

b) I throw two dice onto the floor. One rolls under a bookshelf and is 
invisible. My friend who stands near the coin tells me "At least one of 
them is head" . What is the probability that the hidden one is head? 
Explain why in a) the probability is 1/3 and in b) the probability is 1/2. 



Definition. Two events A, B in s probability space (f2, A, P) are called in- 
dependent, if 

P[A(~] B] = P[A] - P[B] . 

Example. The probability space fi = {1, 2, 3, 4, 5, 6 } and pi = P[{«}] = 1/6 
describes a fair dice which is thrown once. The set A = {1,3,5 } is the 
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event that "the dice produces an odd number". It has the probability 1/2. 
The event B = { 1 , 2 } is the event that the dice shows a number smaller 
than 3. It has probability 1/3. The two events are independent because 
P[A n B] = P[{1}] = 1/6 = P[A] ■ P[B}. 

Definition. Write J C/ / if J is a finite subset of I. A family {Ai}i^i of <r- 
sub-algebras of A is called independent, if for every JC/ / and every choice 
Aj G Aj P[C\j e j Aj] = Iligj P[A?']- A family {Xj}j^j of random variables 



is called independent, if \cr(Xj)}j & j are independent cr-algcbras. A family 
of sets {Aj}j(zi is called independent, if the er-algebras Aj = {0, Aj, Aj, il } 
arc independent . 

Example. On Q = {1, 2, 3, 4 } the two a-algebras A = {0, {1, 3 }, {2, 4 }, fl } 
and B = {0, {1, 2 }, {3, 4 }, fi } are independent. 



Properties. (1) If a er- algebra J 7 C A is independent to itself, then P[An 
A] = P[A] = P[A} 2 so that for every A G J 7 , P[A] G {0, 1}. Such a cr-algebra 
is called P-trivial. 

(2) Two sets A,B G A are independent if and only ifP[AnB] = P[A]-P[B]. 

(3) If A, B are independent, then A, B c are independent too. 

(4) If P[B] > 0, and A,B are independent, then P[A|B] = P[A] because 



(5) For independent sets A, B, the tr-algcbras A — {0, A, A c ,il} and B = 
{0, B, B c , £1} arc independent. 



Definition. A family 1 of subsets of fl is called a 7r-system, if I is closed 
under intersections: if A, B are in X, then A n B is in I. A c-additive map 
from a 7r-system I to [0, oo) is called a measure. 

Example. 1) The family 1 = {0, {1}, {2}, {3}, {1, 2}, {2, 3}, £1} is a 7r-system 
on n = {1,2,3}. 

2) The set 2 = {[a, b) |0 < a < b < 1} U {0} of all half closed intervals is a 
7r-system on = [0, 1] because the intersection of two such intervals [a, b) 
and [c, d) is either empty or again such an interval [c, 6). 

Definition. We use the notation A n /• A if C and (J n A„ = A. 

Let fi be a set. (0,1?) is called a Dynkin system if I? is a set of subsets of 
n satisfying 




P[A\B) = (P[A)-P[B})/P[B} = P[A}. 



(i) SJeP, 

(ii) A,BeV,AcB^>B\AeV. 
(hi) 4„eD,A„/A^4er> 
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Lemma 2.1.2. (ft, A) is a c-algebra if and only if it is a 7r-system and a 
Dynkin system. 



Proof. If A is a er-algebra, then it certainly is both a 7r-system and a Dynkin 
system. Assume now, A is both a 7r-system and a Dynkin system. Given 
A,B e A. The Dynkin property implies that A c = CI \ A, B c = Cl \ B 
are in A and by the 7r-system property also A U B = CI \ (A c n B c ) G A. 
Given a sequence A n G „4. Define B n = Ufc=i Ak ^ A and A = Un^™- 
Then £?„ A and by the Dynkin property A £ A. We see that *4 is a 
cr-algebra. □ 

Definition. If X is any set of subsets of CI, we denote by d(T) the smallest 
Dynkin system, which contains X and call it the Dynkin system generated 
by X. 



Lemma 2.1.3. If I is a tt- system, then d(X) = ct(X). 



Proof. By the previous lemma, we need only to show that d(X) is a tt— 
system. 

(i) Define V x = {B G d(X) BflC £ d(X),VC £ X }. Because I is a 
7r-system, we have X C 2?i . 

Claim. I?i is a Dynkin system. 

Proof. Clearly Cl G £>i. Given G £>i with 4 C B. For C 6 1 wc 

compute (B \ 4) fl C = (B fl C) \ (4 fl C) which is in d(X). Therefore 
B\A G V x . Given A„ / A with A n G X>i and Cel. Then A„nC / AnC 
so that A n C G d(X) and A G V\. 

(ii) Define P 2 = {Ae d(X) \ B n A e d(X),VB G d(X) }. From (i) we know 
that X C 2?2- Like in (i), we show that 2? 2 is a Dynkin-system. Therefore 
T>2 = d(I), which means that diX) is a 7r-system. □ 



Lemma 2.1.4. (Extension lemma) Given a 7r-system X. If two measures \i, v 
on cr(X) satisfy /x(f2), i/(f2) < 00 and = f(.A) for A G X, then [L = v. 



Proof. The set D = {4e cr(X) | /x(A) = } is Dynkin system: first 

of all QeD. Given 4,BeD,4cB. Then /z(B \ A) = £t(B) - fj,(A) = 
v(B)-v(A) = v{B\A) so that B\A G X>. Given A„ G 2? with A n /• A, then 
the <7 additivity gives /x(A) = limsup ?l n(A n ) = limsup„ v(A n ) = v(A), so 



,34 



Chapter 2. Limit theorems 



that A E T>. Since I? is a Dynkin system containing the 7r-system I, we 
know that cr(X) = d(T) C T> which means that \i = v on cr(I). □ 

Definition. Given a probability space (Q,A, P). Two 7r-systcms I,JdA 
are called P-independent, if for all A e 1 and B E J, P[Af]B] = P[A]-P[B}. 



Lemma 2.1.5. Given a probability space (SI, ^4., P) . Let Q ,H be two o- 
subalgcbras of A and 2" and J be two 7r-systems satisfying cr(I) = Q. 
o~(J) = H. Then Q and % are independent if and only if I and J are 
independent. 



Proof, (i) Fix I E I and define on (fi,H) the measures /i(-ff) = P[J D 
H],v{H) = P[I]P[H] of total probability P[i]. By the independence of 2 
and J , they coincide on J7" and by the extension lemma (2.1.4), they agree 
on H and wc have P[/flff]= P[I]P[H] for all I € 1 and H E H. 
(ii) Define for fixed H E H the measures fi(G) = P[G D H] and u(G) = 
P[G]P[H] of total probability P[H] on (0, Q). They agree on I and so on Q. 
We have shown that P[GC\H] = P [G] P [H] for all G E Q and all H E H . □ 

Definition. A is an algebra if A is a set of subsets of f2 satisfying 



(i) lieA 

(ii) A e A A c e A, 

(hi) A,B E A^ AnB E A 



Remark. We see that A c n B = B\A and A n B c = A \ B are also in the 
algebra A. The relation AUB = (A c nB c ) c shows that the union AuB in the 
algebra. Therefore also the symmetric difference AAB = (A \ B) U (B \ A) 
is in the algebra. The operation (~l is the " multiplication" and the operation 
A the " addition" in the algebra, explaining the name algebra. Its up to you 
to find the zero element OAA = A for all A and the one element 1 n A = A 
in this algebra. 

Definition. A a- additive map from A to [0,oo) is called a measure. 



Theorem 2.1.6 (Caratheodory continuation theorem). Any measure on an 
algebra 1Z has a unique continuation to a measure on o~(lZ). 



Before we launch into the proof of this theorem, we need two lemmas: 

Definition. Let A be an algebra and X : A ^ [0, oo] with A(0) = 0. A set 
A E A is called a A-set, if X(A n G) + \{A C nG) = \(G) for all G E A. 
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Lemma 2.1.7. The set A\ of A-sets of an algebra A is again an algebra and 
satisfies ^fe=i M A k H G) = A(((J^ =1 A fc ) n G) for all finite disjoint families 
{ A kK=i and all G G A. 



Proof. From the definition is clear that <E A\ and that if B G A\, then 
B c G _4 A - Given B,C e A\. Then i = BflCeA. Proof. Since G G .4a, 
we get 

A(C n#nG) + A(G C n#nG) = A(4 C n G) . 

This can be rewritten with C A c = C B c and C c (1 A c = C c as 

A(#nG) = A(cnB c nG) + A(c c nG). (2.1) 

Because B is a A-set, we get using B C\ C = A. 

\(A n G) + \(b c n G n G) = A(C n G) . (2.2) 

Since G is a A-set, we have 

A(G n G) + A(G C n G) = A(G) . (2.3) 

Adding up these three equations shows that B n G is a A-set. We have so 
verified that „4a is an algebra. If B and G are disjoint in ,4a we deduce 
from the fact that B is a A-set 

X(B n (B u G) n G) + A(s c n (b u g) n G) = x((b u g) n G) . 

This can bc rewritten as A(£?nG) + A(GnG) = A((BUC)flG). The analog 
statement for finitely many sets is obtained by induction. □ 

Definition. Let A be a cr-algebra. A map A : A — > [0, oo] is called an outer 
measure, if 



A(0) = 0, 

A, B G A with A c B =» X{A) < X(B). 

A n €A => A(U„ A n ) < £„ A(A„) subadditivity) 



Lemma 2.1.8. (Caratheodory's lemma) If A is an outer measure on a mea- 
surable space (f2, A), then A\ C A defines a cr-algebra on which A is count- 
ably additive. 
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Proof. Given a disjoint sequence A n E A\. We have to show that A = 
\J n A n e A\ and \(A) = E„A(A„). By the above lemma (2.1.7), B n = 
Uk=i Ak is in By the monotonicity, additivity and the a -subadditivity, 
we have 

A(G) = X(B n n G) + X(B c n n G) > \{B n n G) + A(,4 C n G) 

= ^ A(A fc n G) + A(A C n G) > A(>4 n G) + A(A C n G) . 
fc=l 

Subadditivity for A gives A(G) < A (A n G) + A(A C n G). All the inequalities 
in this proof are therefore equalities. We conclude that A € A\. Finally we 
show that A is a additive on A\: for any n > 1 we have 

n n oo 

$>(A fc )<A(|J A k <J2HAk). 

fc=i fe=i fc=i 

Taking the limit n — > oo shows that the right hand side and left hand side 
agree verifying the it- additivity. □ 

We now prove the Caratheodory's continuation theorem (2.1.6): 

Proof. Given an algebra TZ with a measure \x. Define A = o~(lZ) and the 
a- algebra V consisting of all subsets of ft. Define on V the function 

X(A) = inf{^ n(A n ) | {A„}„ £ n sequence in TZ satisfying A C [J A n } . 

(i) A is an outer measure on V . 

A(0) = and X(A) < X(B) for A C A are obvious. To see the a subad- 
ditivity, take a sequence A n € V with X(A n ) < oo and fix e > 0. For all 
n € N, one can (by the definition of A) find a sequence {-B n ,fc}fceN in TZ 
such that A n C IJfeeN B n,k and J2keN K B n ,k) < X(A n ) + e2~ n . Define A = 

\J nen A n C Un^Bn.k, SO that X(A)^ < En.k^Bn.k) < £„ A(A n ) + C 

Since e was arbitrary, the tr-subadditivity is proven. 

(ii) A = /j, on TZ. 

Given A e TZ. Clearly X(A) < fi(A). Suppose that A c \J n A„ , with A n € 
TZ. Define a sequence {B n } nG ^ of disjoint sets in TZ inductively. That is B\ = 
A u B n = A n n (U fe< „ A k f such that B n C v4„ and U„ B n = \J n K D A. 
From the cr-additivity of ii on TZ follows 

H{A) < m(IJ A ») = M(U S ") = ' 

n n n 

Since the choice of A n is arbitrary, this gives fJ,(A) < X(A). 

(iii) Let V x be the set of A-sets in V. Then TZdVx. 

Given A £ TZ and G £ P. There exists a sequence {-B n }neN in TZ such that 
G C U„ B n and £ n < A(G) + e. By the definition of A 
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because An G c \J n A n B n and ^ c n G c U n ^ c n S„. Since e is ar- 
bitrary, we get A(G) > A(A n G) + A(A C n G). On the other hand, since 
A is sub-additive, we have also A(G) < A(AnG) + A(A c nG) and A is a A-set. 

(iv) By (i) A is an outer measure on (Cl,V). Since by step (hi), 1Z C V\, 
we know by Caratheodory's lemma that A C V\, so that we can dchnc fi 
on A as the restriction of A to A. By step (ii), this is an extension of the 
measure fi on 1Z. 

(v) The uniquness follows from Lemma (2.1.4). □ 

Here is an overview over the possible set of subsets of fi we have considered. 
We also include the notion of ring and cr-ring, which is often used in measure 
theory and which differ from the notions of algebra or cr-algebra in that 
f2 does not have to be in it. In probability theory, those notions are not 
needed at first. For an introduction into measure theory, see [3, 38, 18]. 



Set of ft subsets 


contains 


closed under 


topology 


0,f2 


arbitrary unions, finite intersections 


7r-system 




finite intersections 


Dynkin system 


n 


increasing countable union, difference 


ring 





complement and finite unions 


cr-ring 





countably many unions and complement 


algebra 


n 


complement and finite unions 


cr-algebra 


0,0 


countably many unions and complement 


Borel cr-algebra 


0,fi 


cr-algebra generated by the topology 



Remark. The name " ring" has its origin to the fact that with the " addition" 
A + B = AAB = (A U B) \ (A D B) and "multiplication" A ■ B = A n B, 
a ring of sets becomes an algebraic ring like the set of integers, in which 
rules like A ■ (B + C) = A ■ B + A ■ C hold. The empty set is the zero 
element because AA% = A for every set A. If the set O is also in the ring, 
one has a ring with 1 because the identity A n fi = A shows that O is the 
1-element in the ring. 

Lets add some definitions, which will occur later: 

Definition. A nonzero measure /x on a measurable space (£l,A) is called 
positive, if fi(A) > for all A € A. If /x + ,/i - are two positive measures 
so that fj,(A) = fi + — u~ then this is called the Hahn decomposition of /i. 
A measure is called finite if it has a Hahn decomposition and the positive 
measure |/z| defined by = fi + (A) + n~(A) satisfies < oo. 

Definition. Let v, [i be two measures on the measurable space (H,,A). We 
write v << fi if for every A in the cr-algebra A, the condition fi{A) = 
implies v(A) = 0. One says that v is absolutely continuous with respect to 

Example. If fj, = dx is the Lebesgue measure on (CI, A) = ([0,1], A) sat- 
isfying /J,([a, b]) = b — a for every interval and if v([a, b]) = J x 2 dx then 

v « fi. 
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Example. If /j, = dx is the Lebesgue measure on ([0, l],-4) and v = 6\/2 is 
the point measure which satisfies v(A) = 1 if 1/2 G A and v(A) = else. 
Then v is not absolutely continuous with respect to \i. Indeed, for the set 
A = {1/2}, we have \i{A) = but v{A) = 1. 

2.2 Kolmogorov's 0—1 law, Borel-Cantelli lemma 

Definition. Given a family {Ai }i<= / of er-subalgcbras of A. For any nonempty 
set J C I, let Aj := Vjej-Aj be the cr-algebra generated by [Jj e jAj. 
Define also A$ = {0,fi }. The tail cr-algebra T of {A\i^i is defined as 
T = (I jc/, J finite *J> . where J c = T\J . 



Theorem 2.2.1 (Kolmogorov's — 1 law). If {Ai}i£i arc independent a- 
algcbras, then the tail cr-algebra T is P-trivial: P[A] =0 or P[A] = 1 for 
every A G T. 



Proof, (i) The algebras ^4f and „4g are independent, whenever F,G C I 
are disjoint. 

Proof. Define for H C / the 7r-system 

= {A G A I A= P| 4„A'C/ H,Ai G A} • 

The 7r-systems and Ig are independent and generate the cr-algebras Af 
and Ag- Use lemma (2.1.5). 

(ii) Especially: Aj is independent of Ajo for every J C I. 

(iii) T is independent of A[. 

Proof. T = Hjc /^J c i s independent of any Ak for A' C/ /. It is 
therefore independent to the 7r-system Ij which generates Aj. Use again 
lemma (2.1.5). 

(iv) T is a sub-cr-algcbra of Ai- Therefore T is independent of itself which 
implies that it is P-trivial. □ 

Example. Let X n be a sequence of independent random variables and let 

oo 

A = {lo G ft | X n converges } . 

n=l 

Then PL4] = or P[A] = 1. Proof. Because Y^Li X n converges if and only 
if Y n = X)fcLn^ fc converges, we have A G cr(A n ,A n+ i . . .). Therefore, A 
is in T, the tail a- algebra defined by the independent cr-algcbras A n = 
a(X n ). If for example, if X n takes values ±l/n, each with probability 1/2, 
then P[A] = 0. If X n takes values ±l/n 2 each with probability 1/2, then 
P[A] = 1. The decision whether P[A] =0 or P[A] = 1 is related to the 
convergence or divergence of a series and will be discussed later again. 
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Example. Let {^4 n }neN be a sequence of subsets of H. The set 

oo 

Aoo := limsup A n = C\ (J A n 

n— >oo . ^ 

m — 1 n>m 

consists of the set {w G 0} such that w G A n for infinitely many n G N. 
The set is contained in the tail cr-algebra of A„ = {0, A„ , A£,f2 }. It 
follows from Kolmogorov's 0—1 law that P[Aoo] G {0, 1} if A n G A and 
{^4 n } are P-indepcndcnt. 

Remark. In the theory of dynamical systems, a measurable map T : Q, — > SI 
of a probability space (Q, .A, P) onto itself is called a iT-system, if there 
exists a er-subalgebra T C A which satisfies T C <r(T{F)) for which the 
sequence .F„ = <j(T n (F)) satisfies = A and which has a trivial tail 
cr-algebra T = {0, SI}. An example of such a system is a shift map T(x) n = 
x n+ i on 11 = A N , where A is a compact topological space. The i-T-system 
property follows from Kolmogorov's 0—1 law: take T = VfeLi T k {Fo\ with 
To = {x G n = A z | x = r G A }. 



Theorem 2.2.2 (First Borel-Cantelli lemma). Given a sequence of events 
A n G A Then 

£)P[4J < co => P[4»] = . 

nGN 



Proof. PlAoo] = lim^oo P[U fe >„ A k ] < lim^^ £ fe >„ PL4 fc ] = 0. 

□ 



Theorem 2.2.3 (Second Borcl-Cantclli lemma). For a sequence A n G A of 
independent events, 

^P[ J 4„] = oo^P[A 00 ] = 1 . 

nGN 



Proof. For every integer n G N, 

= n( i - p t^])^ n cx p(- p [^]) 

= ex P (-^PL4 fe ]). 
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The right hand side converges to for n — > oo. From 



pki=p[U n^]^E p [n^]=° 



nGffk>n nSN k>n 



follows P[A^] = 0. 



□ 



Example. The following example illustrates that independence is necessary 
in the second Borel-Cantelli lemma: take the probability space ([0, 1],B, P), 
where P = dx is the Lebesgue measure on the Borel tr-algebra B of [0, 1]. 
For A n = [0, 1/n] we get = and so P[Ax>] = 0. But because P[A„] = 
1/n we have J^^Li P|y4ra] = X^^Li h = 00 because the harmonic series 
X^Li V n diverges: 



Example. (" Monkey typing Shakespeare" ) Writing a novel amounts to en- 
ter a sequence of TV symbols into a computer. For example, to write "Ham- 
let" , Shakespeare had to enter TV = 180'000 characters. A monkey is placed 
in front of a terminal and types symbols at random, one per unit time, pro- 
ducing a random sequence X n of identically distributed sequence of random 
variables in the set of all possible symbols. If each letter occurs with prob- 
ability at least e, then the probability that Hamlet appears when typing 
the first TV letters is e N . Call A\ this event and call Aj~ the event that 
this happens when typing the (fc — 1)TV + 1 until the fcTV'th letter. These 
sets Ak are all independent and have all equal probability. By the second 
Borel-Cantelli lemma, the events occur infinitely often. This means that 
Shakespeare's work is not only written once, but infinitely many times. Be- 
fore we model this precisely, lets look at the odds for random typing. There 
are 30^ possibilities to write a word of length TV with 26 letters together 
with a minimal set of punctuation: a space, a comma, a dash and a period 
sign. The chance to write "To be, or not to be - that is the question." 
with 43 random hits onto the keyboard is 1/10 63 5 . Note that the life time 
of a monkey is bounded above by 131400000 ~ 10 8 seconds so that it is 
even unlikely that this single sentence will ever be typed. To compare the 
probability, it is helpful to put the result into a list of known large numbers 



10 4 One "myriad". The largest numbers, the Greeks were considering. 

10 5 The largest number considered by the Romans. 
10 10 The age of the universe in years. 

10 17 The age of the universe in seconds. 

10 22 Distance to our neighbor galaxy Andromeda in meters. 

10 23 Number of atoms in two gram Carbon which is 1 Avogadro. 
10 27 Estimated size of universe in meters. 

10 30 Mass of the sun in kilograms. 

10 41 Mass of our home galaxy "milky way" in kilograms. 

10 Archimedes's estimate of number of sand grains in universe. 

10 80 The number of protons in the universe. 
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10 100 One "googol". (Name coined by 9 year old nephew of E. Kasner). 

10 153 Number mentioned in a myth about Buddha. 

10 15 Size of ninth Fermat number (factored in 1990). 

10 10 Size of large prime number (Mersenne number, Nov 1996). 

10 10 Years, ape needs to write "hound of Baskerville" (random typing). 

10 10 Inverse is chance that a can of beer tips by quantum fluctuation. 

10 10 Inverse is probability that a mouse survives on the sun for a week. 

io 50 

10 Estimated number of possible games of chess. 

10 10 Inverse is chance to find yourself on Mars by quantum fluctuations 

10 lol °° One "Gogoolplex" 



Lemma 2.2.4. Given a random variable X on a finite probability space A, 
there exists a sequence X\, X2, ■ ■ ■ of independent random variables for 
which all random variables Xi have the same distribution as X. 



Proof. The product space CI = A N is compact by Tychonov's theorem. Let 
A be the Borcl-er-algcbra on CI and let Qdenote th e probability measure on 
A. The probability measure P = Q z is defined on (CI, A) has the property 
that for any cylinder set 

Z(w) = {uj G CI I uj k = r k ,uj k+ i = r k+1 , ...,w n = r„ } 

defined by a "word" w = [r&, , . . . r n ], 

n n 

P[Z(w)}=l[P[^ = ri} = l[Q({ri}). 

i—k i—k 

Finite unions of cylinder sets form an algebra 1Z which generates cr(lZ) = A. 
The measure P is er-additive on this algebra. By Caratheodory's continu- 
ation theorem (2.1.6), there exists a measure P on (CI, A). For this proba- 
bility space (CI, A, P), the random variables Xi(uS) = uji) arc independent 
and have the same distribution as X. □ 



Remark. The proof made use of Tychonov's theorem which tells that the 
product of compact topological spaces is compact. The theorem is equiv- 
alent to the Axiom of choice and one of the fundamental assumptions of 
mathematics. Since Tychonov's theorem is known to be equivalent to the 
axiom of choice, we can assume it to be a fundamental axiom itself. The 
compactness of a countable product of compact metric spaces which was 
needed in the proof could be proven without the axiom using a diagonal 
argument. It was easier to just refer to a fundamental assumption of math- 
ematics. 
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Example. In the example of the monkey writing a novel, the process of 
authoring is given by a sequence of independent random variables X n (ui) = 
u) n . The event that Hamlet is written during the time [Nk + 1, N(k + 1)] 
is given by a cylinder set A^. They have all the same probability. By the 
second Borel-Cantclli lemma, P[Aoo] = 1. The set Aoo, the event that the 
Monkey types this novel arbitrarily often, has probability 1. 

Remark. Lemma (2.2.4) can be generalized: given any sequence of prob- 
ability spaces (K, B, Pj) one can form the product space (Q,A,P). The 
random variables Xi(u)) = uji are independent and have the law Pj. An 
other construction of independent random variables is given in [109]. 



Exercise. In this exercise, we experiment with some measures on Q = N 
[113]. 

a) The distance d(n,m) = \n — m\ defines a topology O on O = N. What 
is the Borel tr-algebra A generated by this topology? 

b) Show that for every A > 



m = E 



n. 

n£A 

is a probability measure on the measurable space (Q,A) considered in a). 

c) Show that for every s > 1 

p[A\ = J2 c(*) _1 »" a 

Tie a 

is a probability measure on the measurable space (Cl,A). The function 

n" 

neil 

is called the Riemann zeta function. 

d) Show that the sets A p = {n € fi| p divides n} with prime p are indepen- 
dent. What happens if p is not prime. 

e) Give a probabilistic proof of Euler's formula 



1 tt 1 

^= n 



C(s) 

f ) Let A be the set of natural numbers which are not divisible by a square 
different from 1. Prove 
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2.3 Integration, Expectation, Variance 
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In this entire section, (0,^4, P) will denote a fixed probability space. 



Definition. A statement S about points u> € £1 is a map from fl to {true, false} 
A statement is said to hold almost everywhere, if the set P[{uj | S(lu) ~ 
false }] = 0. For example, the statement "let X n — !> X almost everywhere" , 
is a short hand notation for the statement that the set {x G O | X n (x) — > 
X(x) } is measurable and has measure 1. 

Definition. The algebra of all random variables is denoted by £. It is a 
vector space over the field R of the real numbers in which one can multiply. 
A elementary function or step function is an element of £ which is of the 
form 

n 

X = <M • 1 A , 

i=l 

with en € R and where A; 6 A arc disjoint sets. Denote by S the algebra 
of step functions. For X € S we can define the integral 

E[X] := / X dP = J2 a i p [ A i] ■ 



Definition. Define C 1 C C as the set of random variables X, for which 



sup Y dP 

YeS,Y<\x\ J 

is finite. For X £ C 1 , we can define the integral or expectation 

E[X] := I X dP= sup jYdP- sup I Y dP , 

J YeS,Y<x+J YeS.Y<x-J 

where X+ = X V = max(A, 0) and X~ = -X V = max(-A, 0). The 
vector space £ 1 is called the space of integrable random variables. Similarly, 
for p > 1 write C p for the set of random variables X for which E[|A| P ] < oo. 

Definition. It is custom to write L 1 for the space £ , where random vari- 
ables X, Y for which E[|A — Y\] = are identified. Unlike C p , the spaces 
L p are Banach spaces. We will come back to this later. 

Definition. For X <G £ 2 , we can define the variance 

Var[A] := E[(X - E[A]) 2 ] = E[A 2 ] - E[A] 2 . 
The nonnegative number 

<t[X] = VarfA] 1 / 2 



is called the standard deviation of X. 
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The names expectation and standard deviation pretty much describe al- 
ready the meaning of these numbers. The expectation is the " average" , 
"mean" or "expected" value of the variable and the standard deviation 
measures how much we can expect the variable to deviate from the mean. 



Example. The m'th power random variable X(x) = x m on ([0, 1],B, P) has 
the expectation 



E[X] = [ x m dx = 

Jo rn + 1 



the variance 

1 



Var[X] = E[X 2 } - E[X} 2 = 



2m + 1 (to+1) 2 (l + m) 2 (l + 2m) 



and the standard deviation cr\X] = — r . Both the expectation 

L J (l+m) N /(l+2m) 

as well as the standard deviation converge to if m — > oo. 

Definition. If X is a random variable, then E[X m ] is called the m'th mo- 
ment of X. The m'th central moment of X is defined as E[(X - E[X]) m }. 

Definition. The moment generating function of X is defined as Mx(t) — 
E[e* x ]. The moment generating function often allows a fast simultaneous 
computation of all the moments. The function 

K X (t) = Iog(Afx(t)) 
is called the cumulant generating function. 

Example. For X(x) = x on [0,1] we have both 

■1 (e t _ ^ ~ t m-l « t m 



-J, (jn + 1)! 

m— 1 m— 



and 



Mx(t) = E[e«] = E[£ ^1 - E * mM T 



ml 

m= m— 



Comparing coefficients shows E[X m ] = l/(m+ 1). 

Example. Let fi = R. For given me K, er > 0, define the probability 
measure P[[ct, &]] = / /(#) dx with 

1 ( .- m) 2 

/(*) 



V27R7 2 



This is a probability measure because after a change of variables y = 
(x— m)/(v / 2o : ), the integral /(a;) dx becomes = 1. The 

random variable X(x) = x on (0,-4, P) is a random variable with Gaussian 



2.3. Integration, Expectation, Variance 



45 



distribution mean m and standard deviation a. One simply calls it a Gaus- 
sian random variable or random variable with normal distribution. Lets 
justify the constants to and a: the expectation of X is E[X] = J X dP = 
J_ x/(x) dx = m. The variance is E[(X — to) 2 ] = J_ x 2 f(x) dx = a 2 
so that the constant a is indeed the standard deviation. The moment gen- 
crating function of X is Mx(i) = e mt+cr * I" 1 . The cumulant generating 
function is therefore Kx(t) = mt + a 2 t 2 /2. 



Example. If X is a Gaussian random variable with mean m — and 
standard deviation a, then the random variable Y = e x has the mean 
E[Y] = E[e x ] = e ff2 / 2 . Proof: 

e y ^ dy = e /2 / e ^ dy ~— e . 



\j2ito 2 J-oo y/2~7Ta 
The random variable Y has the log normal distribution 



Example. A random variable X <E C 2 with standard deviation a = is a 
constant random variable. It satisfies X(w) = m for all we!l. 

Definition. If X e £ 2 is a random variable with mean to and standard 
deviation a, then the random variable Y = (X — m) / a has the mean to = 
and standard deviation a = 1. Such a random variable is called normalized. 
One often only adjusts the mean and calls X — E[X] the centered random 
variable. 



Exercise. The Rademacher functions r n (x) are real- valued functions on 
[0. 1} defined by 

, , / 1 2*=i <x< 2* 

y. n — n 

They are random variables on the Lebesgue space ([0, 1], A, P = dx). 

a) Show that 1 — 2x = X)^Li ■ This means that for fixed x, the sequence 
r„(x) is the binary expansion of 1 — 2x. 

b) Verify that r n (x) — sign(sin(27r2"~ 1 a;)) for almost all x. 

c) Show that the random variables r n (x) on [0, 1] arc IID random variables 
with uniform distribution on { — 1, 1 }. 

d) Each r n (x) has the mean E[r„] = and the variance Var[r n ] = 1. 
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Figure. The 

Rademacher Function 
r x (x) 



Figure. The 

Rademacher Function 



Figure. The 

Rademacher Function 



Exercise. Given any 0—1 data of length n. Let k be the number of ones. If 
p = k/n is the mean, verify that we can compute the variance of the data 
as p(l — p). A statistician would prove it as follows: 

-(k(l-p) 2 + (n-k)(0-p) 2 ) 
n 

(k — 2kp + np 2 )/n = p — 2p + p 2 = p 2 — p = p(l — p) . 

Give a shorter proof of this using E[X 2 ] = E[X] and the formulas for 
Var[X}. 



n ^-^ 

i=l 



2.4 Results from real analysis 

In this section we recall some results of real analysis with their proofs. 
In the measure theory or real analysis literature, it is custom to write 
/ f(x) dp{x) instead of E[X] or /, g, h, . . . instead of X, Y, Z, . . . , but this 
is just a change of vocabulary. What is special about probability theory is 
that the measures p are probability measures and so finite. 



Theorem 2.4.1 (Monotone convergence theorem, Beppo Levi 1906). Let X n 
be a sequence of random variables in £ 1 with < X\ < X%, . . . and assume 
X = linin-Hxj X n converges point wise. If sup ra E[X„] < oo, then X € C 1 
and 

E[X] = lim E[X n ] . 



2.4. Results from real analysis 47 

Proof. Because we can replace X n by X n — X±, we can assume X n > 0. 
Find for each n a monotone sequence of step functions X n ^ m € S with 
X n = sup m X„ !m . Consider the sequence of step functions 

Y n := sup X ky7l < sup X ky7l+1 < sup X k , n+1 = Y n+1 . 

l<k<n l<k<n l<fe<ri+l 

Since Y n < sup^ =1 X m = X n also E[Y n ] < E[X n ]. One checks that 
sup„ Y n = X implies sup„ ~E[Y n ] = sup y£ £ y<x ^P^] an< ^ concludes 

E[X] = sup E[Y] = supE[F„] < supE[X„] < E[supX„] = E[X] . 
YeS,Y<x n n n 

We have used the monotonicity E[X„] < ELY n+ i] in sup„ELY„] = E[X]. 

□ 



Theorem 2.4.2 (Fatou lemma, 1906). Let X n be a sequence of random 
variables in C 1 with \X n \ < X for some X eC 1 . Then 

E[liminf X n ] < liminf E[X n ] < limsupE[X„] < E[limsupX„] . 



Proof. For p > n, we have 



Therefore 



inf X m < X p < sup X„ 

rn>n m>n 



E[inf X m ] < E[X p ] < E[sup X m ] 



Because p > n was arbitrary, we have also 

E[ inf X m ] < inf E[X P ] < supE[X p ] < E[sup X m ] . 

m>n p>n p >„ m >„ 

Since Y n — inf TO > ra X m is increasing with sup^ E[Y„] < oo and Z n = 
sup m>n X m is decreasing with inf„E[Z„] > — oo we get from Beppo-Lcvi 
theorem (2.4.1) that Y = sup n Y n = limsup n X„ and Z = mf n Z n = 
liminf„X„ arc in C 1 and 

£7 [liminf X n ] = sup E[ inf X m ] < sup inf E[X m ] = liminf E[X n ] 

n n Tn>n n m>n n 

< limsupE[X„] = inf sup ELY m ] 

n n m>n 

< inf E[ sup X m ] = E[limsupX n ] . 

n m>n n 

□ 
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Theorem 2.4.3 (Lebesgue's dominated convergence theorem, 1902). Let X n 
be a sequence in L 1 with \X n \ < Y for some Y G C . If X n X almost 
everywhere, then E[A„] — > E[A~]. 



Proof. Since X = liminf„A„ = limsup n X„ we know that X G C 1 and 
from Fatou lemma (2.4.2) 

E[X] = E [lim inf X n ] < lim inf E [X n ] 

n n 

< limsupE[A„] < E[limsupA„] = E[X] . 

n n 

□ 



A special case of Lebesgue's dominated convergence theorem is when Y = 
A is constant. The theorem is then called the bounded dominated con- 
vergence theorem. It says that E[A„] — > E[X] if \X n \ < K and X n — > X 
almost everywhere. 

Definition. Define also for p G [1, oo) the vector spaces CP = {X G C | \X\ P G 
C 1 } and C°° = {X G £ | 3A" G K X < A, almost everywhere }. 



Example. For 17 = [0, 1] with the Lebesgue measure P = dx and Borcl 
cr-algebra look at the random variable X(x) = x a , where a is a real 
number. Because X is bounded for a > 0, we have then X G For 
a < 0, the integral E[|A| P ] = J Q a; ap is finite if and only if < 1 so 
that A is in C p whenever p > I /a. 



2.5 Some inequalities 

Definition. A function h : K — >• R is called convex, if there exists for all 
io£la linear map l(x) = ax + b such that 1(xq) = h(xo) and for all igl 
the inequality l(x) < h(x) holds. 



Example. h(x) = x 2 is convex, h(x) = e x is convex, h{x) = x is convex. 
/i(.t) = —x 2 is not convex, /i(x) = x 3 is not convex on R but convex on 
K+ = [0,oo). 



2.5. Some inequalities 



Figure. The Jensen inequality in 
the case Q = {u,v }, P[{u}] = 
p [M] = 1/2 and with X(u) = 
a,X(v) = b. The function h in 
this picture is a quadratic func- 
tion of the form h(x) = (x— s) 2 + 
t. 
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E[X] = (a+b)/2 



Theorem 2.5.1 (Jensen inequality). Given X £ C . For any convex function 
h : M -> R, we have 

E[h(X)} > h(E[X]) , 
where the left hand side can also be infinite. 



Proof. Let I be the linear map defined at xq = E[X]. By the linearity and 
monotonicity of the expectation, we get 

h(E[X}) = l(E[X}) = E[l(X)] < E[h(X)] . 

a 

Example. Given p < q. Define h(x) = \x\ q / p . Jensen's inequality gives 
E[\X\i] =E[h(\X\P)} < h(E[\X\P]) = EI\X\p}i/p. This implies that \\X\\ q := 
ElixH 1 /? > EflXlP] 1 ^ = \\X\\ p for p < q and so 

C°° c C q c C p c C l 

for p < q. The smallest space is C°° which is the space of all bounded 
random variables. 



Exercise. Assume X is a nonnegative random variable for which X and 
l/X arc both in C 1 . Show that E[X + 1/X] > 2. 



We have defined C p as the set of random variables which satisfy E[|X| P ] < 
oo for p £ [l,oo) and \X\ < K almost everywhere for p = oo. The vector 
space CF has the semi-norm ||X|| p = E[|X| p ] 1 / p rsp. ||X||oo = ini{K e 
K | \X\ < K almost everywhere }. 
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Definition. One can construct from C p a real Banach space L p = L p jM 
which is the quotient of C p with TV = {X g C p \ \ \X\ \ p = }. Without this 
identification, one only has a pre-Banach space in which the property that 
only the zero element has norm zero is not necessarily true. Especially for 
p = 2, the space L? is a real Hilbert space with inner product < X, Y > = 
E[XY}. 

Example. The function f(x) = lq(x) which assigns values 1 to rational 
numbers x on [0, 1] and the value to irrational numbers is different from 
the constant function g(x) = in C p . But in L p , we have / = g. 
The finitcness of the inner product follows from the following inequality: 



Theorem 2.5.2 (Holder inequality, Holder 1889). Given p,q g [l,oo] with 
p- 1 + q- 1 = 1 and X £ C p and Y g C q . Then XY g C 1 and 

ll^^lli < II-^IIpIk II9 ■ 



Proof. The random variables X, Y are defined over a probability space 
(Q,A, P). We will use that p^ 1 + q~ x = 1 is equivalent to q + p = pq or 
q(p — 1) = p. Without loss of generality we can restrict us to X, Y > 
because replacing X with \X\ and Y with \Y\ does not change anything. 
We can also assume ||-X"|| P > because otherwise X = 0, where both sides 
are zero. We can write therefore X instead of \X\ and assume X is not 
zero. The key idea of the proof is to introduce a new probability measure 

Q X ' P 



E[XP] ' 

If P[A] = J A ldP(x) then Q[A] = [J A X p (x)dP(x)]/E[X p ] so that Q[0] = 
E[X P ]/E[X P ] = 1 and Q is a probability measure. Let us denote the ex- 
pectation with respect to this new measure with Eq. We define the new 
random variable U = lsx>o}Y/X p ^ 1 . Jensen's inequality applied to the 
convex function h(x) = x q gives 

EQ[C7]«<E Q [tf«]. (2.4) 

Using 

r y q , r Y\ E\Y q } 
E Q [t7']=E Q [— 7 - TT ]=E c ^ 



and 



Equation (2.4) can be rewritten as 

E[XY]i < E[Y q ] 



E[Xp]i ~ E[Xt 
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which implies 

E[XY] < E[Y q ] l ' q E[X p ] l - 1 / q = F 1 [Y q } 1 / q E[X p } 1 / p . 

The last equation rewrites the claim < ||X|| p ||y|| g in different 

notation. □ 



A special case of Holder's inequality is the Cauchy-Schwarz inequality 

||jry||i<||jf|| 2 .||y|| 2 . 

The semi-norm property of C p follows from the following fact: 



Theorem 2.5.3 (Minkowski inequality (1896)). Given p € [1, oo] and X, Y € 
C p . Then 

\\X + Y\\ P < \\X\\ P + \\Y\\ P . 



Proof. We use Holder's inequality from below to get 

E[|X + FH < V[\X\\X + YIP- 1 } +E[\Y\\X + Y\ p - 1 } < \\X\\ p C + \\Y\\ P C , 
where C = \\\X + F| p_1 ||g = E[\X + r^] 1 /? w hi c h leads to the claim. □ 



Definition. We use the short-hand notation P [X > c] forP[{w e Vt \ X(uj) > 
c}]. 



Theorem 2.5.4 (Chcbychev-Markov inequality). Let h be a monotone func- 
tion on K with h > 0. For every c > 0, and h(X) G C 1 we have 



h(c) ■ P[X >c}< E[h(X)} 



Proof. Integrate the inequality h(c)lx>c < h(X) and use the monotonicity 
and linearity of the expectation. □ 
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Figure. The proof of the 
Chebychev-Markov inequality in 
the case h(x) — x. The left hand 
side h{c) ■ P[X > c] is the area of 
the rectangles {X > c} x [0, h(x)] 
and E[h(X)] = E[X] is i/ie area 
under the graph of X . 
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Example. h(x) = \x\ leads to P[|^"| > c] < ||X||i/c which implies for 
example the statement 

E[\X\] = => P[X = 0] = 1 . 



Exercise. Prove the ChernofF bound 

P[Jf > c] < inf t > e~ tc Mx(t) 
where Mx(t) = Efe^*] is the moment generating function of X. 



An important special case of the Chebychev-Markov inequality is the Cheby- 
chev inequality: 



Theorem 2.5.5 (Chebychev inequality). If X S £ 2 , then 

Var[X] 



P[LY-E[X]|>c]< 



Proof. Take h(x) = x 2 and apply the Chebychev-Markov inequality to the 
random variable Y = X — E[X] g C 2 satisfying h{Y) e C 1 . □ 

Definition. For X, Y <S C 2 define the covariance 

CovLY,y] :=E[(X-E[X])(F-E[F])] =E[XY] - E[X]E[Y] . 
Two random variables in C 2 are called uncorrelated if Cov[X, Y] = 0. 

Example. We have Cov[X,X] = Var[X] = E[(X - E[X]) 2 } for a random 
variable Ie£ 2 . 
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Remark. The Cauchy-Schwarz-inequality can be restated in the form 

|Covpf,y]| < o[X]a[Y] 

Definition. The regression line of two random variables X, Y is defined as 
y = ax + b, where 

IfO = {l,...,n}isa finite set, then the random variables X, Y define the 
vectors 

X = (X(l), . . .,X(n)), Y = (F(l), . . .,Y(n)) 

or n data points (X(i), Y(i)) in the plane. As will follow from the proposi- 
tion below, the regression line has the property that it minimizes the sum 
of the squares of the distances from these points to the line. 



Figure. Regression line com- 
puted from a finite set of data 
points (X(i),Y(i)). 




Example. If X, Y arc independent, then a = 0. It follows that b = E[Y]. 
Example. If X = Y, then a = 1 and 6 = 0. The best guess for Y is X. 



Proposition 2.5.6. If y = ax + b is the regression line of of X, Y, then the 
random variable Y = aX + b minimizes Var[V — V] under the constraint 
E[y] = E[Y] and is the best guess for Y, when knowing only E[Y] and 
Cov[X,Y]. We check Cov[X,Y] = Cov[X,Y}. 



Proof. To minimize V&r[aX+b— Y] under the constraint E[aX+b— Y] = is 
equivalent to find (a, b) which minimizes /(a, b) = E[(aX + b — Y) 2 ] under 
the constraint g(a, b) = E[aX + b — Y] =0. This least square solution 



54 



Chapter 2. Limit theorems 



can be obtained with the Lagrange multiplier method or by solving b = 
E[Y]-oE[X] and minimizing h(a) = E[(aX -Y -E[aX -Y}) 2 } = a 2 (E[X 2 ]- 
E[X] 2 )-2a(E[XY]-E\X]E[Y]) = a 2 VarLY]-2aCovLY, Y]. Setting h'(a) = 
gives a = CovLY, Y]/Vax[X] . □ 

Definition. If the standard deviations <r[X],cr[y] are both different from 
zero, then one can define the correlation coefficient 

CorrLY, Fj- a[x](j[Y] 

which is a number in [—1,1]. Two random variables in L 2 are called mi- 
correlated if CorrLY, Y] — 0. The other extreme is |Corr[X, Y]\ = 1, then 
Y = aX + b by the Cauchy-Schwarz inequality. 



Theorem 2.5.7 (Pythagoras). If two random variables X,Y S C 2 arc 
independent, then Cov[Y, Y] = 0. If X and Y are uncorrelatcd, then 
VarLY + Y]= VarLY] + Var[F]. 



Proof. We can find monotone sequences of step functions 

n n 

x n = 0Li\ Ai -> x , Y n = ■ is* ->• y ■ 

We can choose these functions in such a way that Aj S A = o~(X) and 
Bj £ B = <j(Y). By the Lebesgue dominated convergence theorem (2.4.3), 
E[X n ] — > E[X] and E[Y n ] — > E[Y] almost everywhere. Compute X n ■ 
Y n = j—i oii^jlAifiBj- By the Lebesgue dominated convergence theo- 
rem (2.4.3) again, E[X n Y n ] — > E[XF]. By the independence of X, Y we 
have E[X n Y n ] = E[X n ] ■ E[Y n ] and so E[XY] = E[X]E[Y] which implies 
Cov[X, Y] = E[XY] - E[X] ■ E[Y] = 0. 
The second statement follows from 

Var[X + Y] = Var[X] + Var[F] + 2 Cov[X, Y] . 

□ 

Remark. If f2 is a finite set, then the covariance Cov[X, Y] is the dot prod- 
uct between the centered random variables X — E[X] and Y — E[Y], and 
a[X] is the length of the vector X — E[X] and the correlation coefficient 
Corr[X, Y] is the cosine of the angle a between X — E[X] and Y — E[Y] 
because the dot product satisfies v ■ w = \v\\w\ cos(a). So, uncorrelated 
random variables X, Y have the property that X — E[X] is perpendicular 
to Y — E[Y}. This geometric interpretation explains, why lemma (2.5.7) is 
called Pythagoras theorem. The statement Var[X — Y] = V&r[X] +Var[F] — 



2.6. The weak law of large numbers 
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2 Cov[X, Y] is the law of cosines c 2 = a 2 + b 2 — 2afocos(a) in disguise if 
a, b, c are the length of the triangle width vertices 0, X — E[X] ,Y — E[Y]. 



For more inequalities in analysis, see the classic [30, 60]. We end this sec- 
tion with a list of properties of variance and covariance: 



Var[X] > 0. 

Var[X] = E[X 2 } -E[X} 2 . 
Var[XX] = A 2 Var[X]. 

Var[X + Y}= Var[X] + Var[y] + 2Cov[X, Y]. Corr[X, Y] 6 [0, 1]. 

Gov[X,Y] = E[XY] - E[X}E[Y}. 

Cov[X,y] < a[X]a[Y}. 

Covr[X, Y] = 1 if X - E[X] = Y - E[Y] 



2.6 The weak law of large numbers 

Consider a sequence X\ , X^ , . . . of random variables on a probability space 
(Q,A, P). We are interested in the asymptotic behavior of the sums S n = 
X\ + X2 + ■ ■ ■ + X n for n — >• 00 and especially in the convergence of the 
averages S n /n. The limiting behavior is described by "laws of large num- 
bers". Depending on the definition of convergence, one speaks of "weak" 
and "strong" laws of large numbers. 

We first prove the weak law of large numbers. There exist different ver- 
sions of this theorem since more assumptions on X n can allow stronger 
statements. 

Definition. A sequence of random variables Y n converges in probability to 

a random variable Y , if for all e > 0, 

lim P[\Y n - Y\ > el = . 

n— >oo 

One calls convergence in probability also stochastic convergence. 

Remark. If for some p G [l,oo), \\X„ — X\\ p — > 0, then X n — > X in 
probability since by the Chebychev-Markov inequality (2.5.4), P[|X n — X\ > 
e] < \\X -X n \\P/eP. 



Exercise. Show that if two random variables X, Y € C 2 have non-zero 
variance and satisfy |Corr(X, Y)\ = 1, then Y = aX + b for some real 
numbers a, b. 
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Theorem 2.6.1 (Weak law of large numbers for uncorrelated random vari- 
ables). Assume X^ £ C? have common expectation E[Aj] = m and satisfy 
sup„ — Y^"—i Var[A,] < oo. If X n are pairwise uncorrelated, then — > m 
in probability. 



Proof. Since Var[X + Y) = Var[X] + Var[F] + 2 • Cov[X,Y] and X n arc 
pairwise uncorrelated, we get Var[X„ + X m ] = Var[X„] + Var[A m ] and by 
induction Var[S„] = Y^i=i Var[X n ]. Using linearity, we obtain ~E[S n /n] = m 
and 

,, r^n-i nr^ni EfSVi] 2 Var[>SVi] 1 r v 1 

Var — = E — — = — = — 2^ Var X « • 

i—l 

The right hand side converges to zero for n — > oo. With Chebychev's in- 
equality (2.5.5), we obtain 

pr ,5„ , . Var[^] 
P to > e < r-^— . 



□ 



As an application in analysis, this leads to a constructive proof of a theorem 
of Weierstrass which states that polynomials are dense in the space C[0, 1] 
of all continuous functions on the interval [0, 1]. Unlike the abstract Weier- 
strass theorem, the construction with specific polynomials is constructive 
and gives explicit formulas. 



Figure. Approximation of a 
function f{x) by Bernstein poly- 
nomials B 2 ,B 5 , B w , B 2 o,B w . 



2.6. The weak law of large numbers 
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Theorem 2.6.2 (Weierstrass theorem). For every / <S C[0, 1], the Bernstein 
polynomials 

converge uniformly to /. If f(x) > 0, then also B n (x) > 0. 



Proof. For x G [0, 1], let Jf„ be a sequence of independent {0, 1}- valued 
random variables with mean value x. In other words, we take the proba- 
bility space ({0, 1 } N ,-4,P) defined by P[w„ = 1} = x. Since P[S n = k] = 

™ j x k (l - p) n ~ k , we can write B n (x) = E[/(^)]. We estimate with 

maxo< x <i|/(a;)| 

\B n (x)-f(x)\ = \E[f(^)]-f(x)\<E[\f(^)-f(x)\] 

n n 

< 2\\f\\.P[\^-x\>6] 

n 

+ sup \f(x)-f(y)\.J>[\—-*\<S\ 

\x-y\<S 71 

< 2 ||/||.P[|^-x|><5] 
+ sup \f(x)~f(y)\. 

\x-y\<8 

The second term in the last line is called the continuity module of /. It 

converges to zero for S — > 0. By the Chebychev inequality (2.5.5) and the 
proof of the weak law of large numbers, the first term can be estimated 
from above by 

,Vax[Xi] 



n6 2 ' 

a bound which goes to zero for n — > oo because the variance satisfies 
Vax[Xi]=x(l-x)<l/4. □ 

In the first version of the weak law of large numbers theorem (2.6.1), we 
only assumed the random variables to be uncorrelated. Under the stronger 
condition of independence and a stronger conditions on the moments (X 4 £ 
C 1 ), the convergence can be accelerated: 



Theorem 2.6.3 (Weak law of large numbers for independent L 4 random 
variables). Assume Xi € £ 4 have common expectation E[XJ = m and 
satisfy M = sup„ H-X^^ < oo. If Xi arc independent, then S n /n — > to in 
probability. Even 2^Li P[| — m\ > e] converges for all e > 0. 



58 



Chapter 2. Limit theorems 



Proof. We can assume without loss of generality that m = 0. Because the 
Xi are independent, we get 

n 

E[S£] = J2 E[X h X i2 X i3 X u ] . 

Again by independence, a summand ElX^X^X^X^] is zero if an index 
i = ik occurs alone, it is E[A t 4 ] if all indices are the same and E[A^ 2 ]E[X|], if 
there are two pairwise equal indices. Since by Jensen's inequality E[X 2 ] 2 < 
E[Xf] < M we get 

E[S*] <nM + n(n-l)M . 
Use now the Chebychev-Markov inequality (2.5.4) with h(x) — x 4 to get 



\&n | ^ , ^ E[(S n / 



i4l 



n + n 2 1 
< M — 7— r- < 2A/— 



□ 

We can weaken the moment assumption in order to deal with L 1 random 
variables. An other assumption needs to become stronger: 

Definition. A family {Xi}i £ j of random variables is called uniformly in- 

tegrable, if sup igJ E[|Xa|l|Xi|>R] — > for R — > oo. A convenient notation 
which we will use again in the future is E[1 A X] = E[X; A] for X g C 1 and 
A g A. Uniform integrability can then be written as sup igJ E[Xi; \X+\ > 
R] ->• 0. 



Theorem 2.6.4 (Weak law for uniformly intcgrable, independent L 1 random 
variables). Assume Xi g C 1 arc uniformly integrable. If Xi are indepen- 
dent, then i ~Y^i = \(X m — E[A m ]) — > in C 1 and therefore in probability. 



Proof. Without loss of generality, we can assume that E[A„] = for all 
n g N, because otherwise X n can be replaced by Y n = X„ — E[X n ]. Define 
fn(t) = fl,ffl; the random variables 

4 fi) = fR&n) - E[f R (X n )}, = X n X™ 

as well as the random variables 
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We estimate, using the Minkowski and Cauchy-Schwarz inequalities 

\\Sn\h < H^lll+II^Hl 

< \\S^\\ 2 + 2 sup E[\X,\;\Xi\>R] 

l<l<n 

< -?=+2supE[|Z ! |;|X i | >R] . 

In the last step we have used the independence of the random variables and 
E[xi R) ] = to get 

l|g^lll=E[(gW) 2 ] = E[(X f ))a] <^. 

The claim follows from the uniform integrability assumption 
sup !eN E[|X;|;|X ; | > R] -> for oo □ 

A special case of the weak law of large numbers is the situation, where all 
the random variables are IID: 



Theorem 2.6.5 (Weak law of large numbers for IID L 1 random variables). 
Assume Xi £ C 1 are IID random variables with mean m. Then S n /n — > to 
in C 1 and so in probability. 



Proof. We show that a set of IID L random variables is uniformly inte- 
grate: given X g C 1 , we have K ■ V[\X\ > K] < ||A||i so that P[\X\ > 
K) -t for K -> oo. 

Because the random variables X% are identically distributed, the probabili- 
ties P[|-X"j| > R] = E[l|Xi > R] are independent of i. Consequently any set 
of IID random variables in C 1 is also uniformly integrable. We can now use 
theorem (2.6.4). □ 

Example. The random variable X(x) = x 2 on [0, 1] has the expectation 
to = E[A] = J Q x 2 dx = 1/2. For every n, we can form the sum S n /n = 

{x\ +x|H + x 2 l )/n. The weak law of large numbers tells us that P[|5 n — 

1/2 1 > e] — > for n — > oo. Geometrically, this means that for every e > 0, 
the volume of the set of points in the n-dimensional cube for which the 
distance r(xi, ..,x n ) = \J x\ + ■ ■ ■ + x^ to the origin satisfies \JnJ1 — e < 
r < \fn/2 + e converges to 1 for n — > oo. In colloquial language, one 
could rephrase this that asymptotically, as the number of dimensions to go 
infinity, most of the weight of a n-dimensional cube is concentrated near a 
shell of radius 1/^/2 ~ 0.7 times the length yjn of the longest diagonal in 
the cube. 
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Exercise. Show that if X, Y <E C 1 are independent random variables, then 
XY € C 1 . Find an example of two random variables X, Y € C 1 for which 
XY i C 1 . 



Exercise, a) Given a sequence p n £ [0, 1] and a sequence X n of IID random 
variables taking values in { — 1, 1} such that P[X n — 1] = p n and P[X n = 

— 1] = 1 — p n . Show that 

1 ™ 

-V(X fe -m fc )^0 
n ' 

in probability, where mfc = 2p,t — 1. 

b) We assume the same set up like in a) but this time, the sequence p n is 
dependent on a parameter. Given a sequence X n of independent random 
variables taking values in { — 1, 1} such that ~P[X n = 1] = p n and P[X n = 

— 1] = 1 — p n with p„ = (1 + cos[6* + na])/2, where 8 is a parameter. Prove 
that —~^2 n X n — > in C 1 for almost all 8. You can take for granted the fact 
th & t \ Sfe=i Pk ~ >• 1/2 for almost all real parameters 9 6 [0, 2-k] 



Exercise. Prove that X n — > X in L , then there exists of a subsequence 
Y n = Xn k satisfying Y n — > X almost everywhere. 



Exercise. Given a sequence of random variables X n . Show that X n con- 
verges to X in probability if and only if 

\X -X\ 

h + \x n -x\ l 

for n — > oo. 



Exercise. Give an example of a sequence of random variables X n which 
converges almost everywhere, but not completely. 



Exercise. Use the weak law of large numbers to verify that the volume of 
an n-dimcnsional ball of radius 1 satisfies V n — > for n — > oo. Estimate, 
how fast the volume goes to 0. (See example (2.6)) 
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2.7 The probability distribution function 



Definition. The law of a random variable X is the probability measure \i on 
R defined by fi(B) = P[X _1 (B)] for all B in the Borcl cr-algebra of R. The 
measure /i is also called the push-forward measure under the measurable 
map X : O -> R. 



Definition. The distribution function of a random variable X is defined as 

F x (s) = u.((-oo,s}) = F[X < s] . 

The distribution function is sometimes also called cumulative density func- 
tion (CDF) but we do not use this name here in order not to confuse it 
with the probability density function (PDF) fx (s) = F' x (s) for continuous 
random variables. 



Remark. The distribution function F is very useful. For example, if X is a 
continuous random variable with distribution function F, then Y = F(X) 
has the uniform distribution on [0, 1]. We can reverse this. If we want to pro- 
duce random variables with a distribution function F, just take a random 
variable Y with uniform distribution on [0, 1] and define X = F _1 (F). This 
random variable has the distribution function F because {X <G [a, b] } = 
{F-i(Y) g [a, b] } = {Ye F([a,b}) } = {Y e [F(a),F(b)}} = F(b) - F(a). 
We see that we need only to have a random number generator which pro- 
duces uniformly distributed random variables in [0, 1] to produce random 
variables with a given continuous distribution. 



Definition. A set of random variables is called identically distributed, if 

each random variable in the set has the same distribution function. It is 
called independent and identically distributed if the random variables are 
independent and identically distributed. A common abbreviation for inde- 
pendent identically distributed random variables is IID. 



Example. Let = [0,1] be the unit interval with the Lebesgue measure (x 
and let m be an integer. Define the random variable X(x) = x m . One calls 
its distribution a power distribution. It is in C 1 and has the expectation 
E[X] = l/(m + 1). The distribution function of X is F x {s) = s^ 1 /™) on 
[0,1] and Fx(s) = for s < and Fx(s) = 1 for s > 1. The random 
variable is continuous in the sense that it has a probability density function 
fx(s) = F' x (s) = s 1 '™- 1 /™ so that F x (s) = f x (t) dt. 




Given two IID random variables X, Y with the m'th power distribution as 
above, we can look at the random variables V — X+Y, W — X—Y. One can 
realize V and W on the unit square = [0, 1] x [0, 1] by V(x, y) = x' n + y m 
and W(x,y) = x m — y rn . The distribution functions i*V( s ) ~ P[V < s ] an d 
F w {s) = P[V < s] are the areas of the set A(s) = {(x,y) \ x m + y m < s } 
and B(s) = {(x, y)\x m - y rn < s }. 





Figure. Fy(s) is the area of the 
set A(s), shown here in the case 
m = 4. 



Figure. F\y(s) is the area of the 
set B(s), shown here in the case 
m = 4. 



We will later see how to compute the distribution function of a sum of in- 
dependent random variables algebraically from the probability distribution 
function Fx- From the area interpretation, we see in this case 



Fy(s) 



J " m {s-x m f/ m dx, 
ll a - lV /m l-(s- x m ) 1/m dx, 



s e [0, 1] . 
se [1,2] 
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and 



F w (s) 



r(s+i) 1/m 
Jo 



- m - s) 1 /™ , se[-i,o], 

1 - (x m - s) 1 /™ etc, s 6 [0, 1] 



G3 




Figure. TTie function Fv{s) with 
density (dashed) fv(s) of the sum 
of two power distributed random 
variables with m = 2. 




Figure. The function Fw{s) with 
density (dashed) fw{s) of the dif- 
ference of two power distributed 
random variables with m = 2. 



Exercise, a) Verify that for 8 > the Maxwell distribution 

ff x ) = A#3/2 x 2 e -^ 

is a probability distribution on R + = [0, oo). This distribution can model 
the speed distribution of molecules in thermal equilibrium, 
a) Verify that for 9 > the Rayleigh distribution 



f(x) = 29xe- 6x 

is a probability distribution on R + = [0, oo). This distribution can model 
the speed distribution y 'X 2 + Y 2 of a two dimensional wind velocity (X, Y), 
where both X, Y are normal random variables. 



2.8 Convergence of random variables 



In order to formulate the strong law of large numbers, we need some other 
notions of convergence. 
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Definition. A sequence of random variables X n converges in probability to 

a random variable X , if 

P[|A„-A|>e]^0 

for all e > 0. 

Definition. A sequence of random variables X n converges almost every- 
where or almost surely to a random variable X, if P[X„ — > X] = 1. 

Definition. A sequence of C p random variables X n converges in C p to a 

random variable X, if 

||X„-X|| p ^0 

for n —> oo. 

Definition. A sequence of random variables X n converges fast in probabil- 
ity, or completely if 

P[\X n -X\ >e]<oo 

n 

for all e > 0. 

We have so four notions of convergence of random variables X n — > X, if 
the random variables are defined on the same probability space (£l,A, P). 
We will later see the two equivalent but weaker notions convergence in 
distribution and weak convergence, which not necessarily assume X n and 
X to be defined on the same probability space. Lets nevertheless add these 
two definitions also here. We will see later, in theorem (2.13.2) that the 
following definitions are equivalent: 

Definition. A sequence of random variables X n converges in distribution, 

if Fx n {s) —> Fx(s) for all points s, where Fx is continuous. 

Example. Let f2„ = {1,2, ,.,,n } with the uniform distribution P[{fc}] = 
1/n and X n the random variable X n (x) = x/n. Let X(x) = x on the prob- 
ability space [0, 1] with probability P[[a, b)] = b — a. The random variables 
X n and X arc defined on a different probability spaces but X n converges 
to X in distribution for n — > oo. 

Definition. A sequence of random variables X n converges in law to a ran- 
dom variable X, if the laws /i„ of X„ converge weakly to the law /i of 
X. 

Remark. In other words, X n converges weakly to X if for every continuous 
function / on R of compact support, one has 



f(x) dn n {x) -> / f(x) dfi(x) 



2.8. Convergence of random variables 
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Proposition 2.8.1. The next figure shows the relations between the different 
convergence types. 



0) In distribution = in law 
Fx n (s) -> Fx(s), F x cont. at s 



1) In probability 
P[\X n -X\ > e] -> 0,Ve > 0. 



2) Almost everywhere 
P[X n ^X} = l 




4) Complete 
E„P[I^«-^I >e] <w,Ve>0 



Proof. 2) 1): Since 

{X„^X} = f||J f]{\X n -X\<l/k} 

k m n > rn 

"almost everywhere convergence" is equivalent to 

i = pu n - *i < I }] = n - *i < \ >] 

rn, n~>m n > rn 

for all A; and so 

0= hm P[|J{|X„-X|>i}] 

n > rn 

for all k. Therefore 

P[\X m ~ X\ > e] < P[ |J {|X n - X| > 6 }] 

for all e > 0. 

4) => 2): The first Borel-Cantelli lemma implies that for all e > 
P[\X n - X\ > e, infinitely often] = . 
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We get so for e„ — > 

P[|J > efc, infinitely often] < ^P[|X„-X| > e fe , infinitely often] = 

n n 

from which we obtain P[X n — > X] = 1. 

3) 1): Use the Chebychcv-Markov inequality (2.5.4), to get 

P || VA1 > f |<S^, 

□ 

Example. Here is an example of convergence in probability but not almost 
everywhere convergence. Let ([0, 1], A, P) be the Lebesgue measure space, 
where A is the Borel o- algebra on [0, 1]. Define the random variables 

X n ,k = l[fc2-»,(fc+i)2-™]! n, = 1, 2, . . . , k = 0, . . . ,2 n — 1 . 

By lexicographical ordering X\ = Xi.i,X2 = X2,i,Xs = Xz.i.X^ = 
^2,3, ... we get a sequence X n satisfying 

liminf X n (uj) = 0, limsupX„(w) = 1 

n— >oo n— ^oo 

but P[\X n , k >e}< 2- n . 

Example. And here is an example of almost everywhere but not C p con- 
vergence: the random variables 

X n = 2' l l[ 0:2 ->«] 

on the probability space ([0, 1},A, P) converge almost everywhere to the 
constant random variable X = but not in C? because ||X„|| p = 2™( p_1 )/ p '. 

With more assumptions other implications can hold. We give two examples. 



Proposition 2.8.2. Given a sequence X n £ £°° with H-XnHoo < K for all n, 
then X n — > X in probability if and only if X n — > X in C . 



Proof, (i) P[\X\ < K) = 1. Proof. For k £ N, 

P[\X\ > K + h< P[\X - X n \ > i] -> 0, n ->■ oo 
so that P[\X\ > K + I] = 0. Therefore 

P[|X|>if] = P[|J{|X|>^+i}] = 0. 

k 
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(ii) Given e > 0. Choose m such that for all n> m 

P\\X n -X\ >-}< — . 

Then, using (i) and the notation E[X; A] = E[X ■ l A ] 

E[\X n -X\] = E[(\X n -X\;\X n -X\> ^}+E[(\X n -X\;\X n -X\< 
< 2KnX n -X\> e -]+ e -<e. 

□ 

Definition. Recall that a family C C C 1 of random variables is called uni- 
formly integrable, if 

lim sup E[\X\l lx>R ] = E[X; \X\ >R]=0 

for all X G C. The next lemma was already been used in the proof of the 
weak law of large numbers for IID random variables. 



Lemma 2.8.3. Given X € C 1 and e > 0. Then, there exists K > with 
E[|X|;|X|>K]<e. 



Proof. Assume we are given e>0. IfXe£ 1 ,we can find 5 > such that if 
T>[A] < S, then E[|X|;A] < e. Since KP[\X\ > K] < E[|X|], we can choose 
K such that P[|X| > K) < S. Therefore E[|X|; |X| > K] < e. □ 

The next proposition gives a necessary and sufficient condition for C 1 con- 
vergence. 



Proposition 2.8.4. Given a sequence random variables X„ G C 1 and X G 
C 1 . The following is equivalent: 

a) X„ converges in probability to X and {X„}„ 6 n is uniformly integrable. 

b) X„ converges in C 1 to X. 



Proof, a) 6) . For any random variable X and K > define the bounded 
variable 

X^ = X • 1{-K<X<K} + K ■ 1{X>K} - K ■ 1{X<-K} ■ 
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By the uniform integrability condition and the above lemma (2.8.3) applied 
to and X we can choose K such that for all n, 

E[|XW-X„|]<|, E[|XW-X|]<|. 

Since \X [ n K) - X {K ^\ < \X n - X\, we have X [ n K) -> JfW in probability. 
By the last proposition (2.8.2), we know Xn ~^ X^ in C 1 so that for 
n > m E[\X n K) - X^\] < e/3. Therefore, for n > m also 

E[\X n - X\] < E[\X n - XW\] + E[\X n K ^ - X™\] + E[\X^ -X\]<e. 

b) =>■ a). We have seen already that X n — > X in probability if — X\\i — > 
0. We have to show that X n — > X in /I 1 implies that X n is uniformly 
integrable. 

Given e > 0. There exists m such that E[|X„ — X\] < e/2 for n > m. By 
the absolutely continuity property, we can choose 6 > such that PL4] < <5 
implies 

E[\X n \;A] <e,l<n<m,E[\X\;A] < e/2 . 

Because X n is bounded in £ , we can choose K such that K~ Y sup„ E[|Jf n |] < 
<5 which implies P[|X„| > K] < S. For n > to, we have therefore, using the 
notation E[X; A] = E[X ■ 1 A ] 

E[|X n |;|X„| > K] <E[\X\;\X n \ > K]+E[\X - X n \] <e. 

□ 



Exercise, a) P[sup fc> „ \Xk — X\ > e] — > for n — > oo and all e > if and 

only if X n — > X almost everywhere. 

b) A sequence X n converges almost surely if and only if 

lim P[sup \X n+k - X n \ > e] = 
fe>i 

for all e > 0. 



2.9 The strong law of large numbers 

The weak law of large numbers makes a statement about the stochastic 
convergence of sums 

S n _ X\ + • • • + X n 
n n 

of random variables X n . The strong laws of large numbers make analog 
statements about almost everywhere convergence. 



2.9. The strong law of large numbers 
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The first version of the strong law does not assume the random variables to 
have the same distribution. They are assumed to have the same expectation 
and have to be bounded in C 4 . 



Theorem 2.9.1 (Strong law for independent I/ 4 -random variables). Assume 
X n are independent random variables in £ 4 with common expectation 
E[X„] = m and for which M = sup„ ||X n ||| < oo. Then S n /n—> m almost 
everywhere. 



Proof. In the proof of theorem (2.6.3), we derived 
P[|— -m\ > e] < 2M- 1 



This means that S n /n — > m converges completely. By proposition (2.8) we 
have almost everywhere convergence. □ 

Here is an application of the strong law: 

Definition. A real number x € [0, 1] is called normal to the base 10, if its 
decimal expansion x = x\Xi . . . has the property that each digit appears 
with the same frequency 1/10. 



Corollary 2.9.2. (Normality of numbers) On the probability space 
([0, 1],B, Q = dx), Lebesgue almost all numbers x are normal. 



Proof. Define the random variables X n (x) — x n , where x n is the n'th 
decimal digit. We have only to verify that X n are IID random variables. The 
strong law of large numbers will assure that almost all x are normal. Let f2 = 
{0, 1, . . . , 9 } N be the space of all infinite sequences us = (u!i,ui2,<^3, ■ ■ ■ )• 
Define on O the product c-algebra A and the product probability measure 
P. Define the measurable map S(w) = Yl^=i UJ k/^0 k = % from fi to [0, 1]. 
It produces for every sequence in O a real number x £ [0, 1]. The integers 
Wit are just the decimal digits of x. The map S is measure preserving and 
can be inverted on a set of measure 1 because almost all real numbers have 
a unique decimal expansion. 

Because X n {x) = X n (S(uj)) = Y n (oj) = uj n , if S(u>) = x. We see that X n 
are the same random variables than Y n . The later are by construction IID 
with uniform distribution on {0, 1, . . . , 9 }. □ 

Remark. While almost all numbers are normal, it is difficult to decide 
normality for specific real numbers. One does not know for example whether 
7T - 3 = 0.1415926 ... or y/2 - 1 = 0.41421 ... are normal. 
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The strong law for IID random variables was first proven by Kolmogorov 
in 1930. Only much later in 1981, it has been observed that the weaker 
notion of pairwise independence is sufficient [25] : 



Theorem 2.9.3 (Strong law for pairwise independent L 1 random variables). 
Assume X n £ C 1 are pairwise independent and identically distributed ran- 
dom variables. Then S n /n — > E[Xi] almost everywhere. 



Proof. We can assume without loss of generality that X n > (because we 
can split X n = X+ + X~ into its positive X^ = X n V = max(X„, 0) and 
negative part X~ = —X V = max(— X, 0). Knowing the result for X^ 
implies the result for X n .). 

Define fn{t) — t ■ h-R t m, the random variables Xn = fn(X n ) and Y n = 
Xn l%> as well as 

^ n 1 n 

S n = / Xi, T n = y Y{ . 
n n ' 

i=l i=l 

(i) It is enough to show that T n — E[T n ] -> 0. 

Proof. Since E[F„] — > E[Xi] = m, we get E[T„] — ^ m. Because 

^P[F„^X„] < ^P[X n >n] = ^P[Xi>n] 

n>l n>l 

= EE p ^ e M+i]] 

n > 1 k>n 

= ^fc-PpTi e [k,k + l]\ <E[XJ <cx), 
fc>l 



we get by the first Borel-Cantelli lemma that P[Yn 7^ X n , infinitely often] = 
0. This means T n — S n — > almost everywhere, proving E[5' n ] — > m if 
E[T„] -> m. 

(ii) Fix a real number a > 1 and define an exponentially growing subse- 
quence k n = [a n ] which is the integer part of a n . Denote by /j, the law of 
the random variables X n . For every e > 0, we get using Chebychev inequal- 
ity (2.5.5), pairwise independence for k n = [a n ] and constants C which can 



2.9. The strong law of large numbers 
vary from line to line: 
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£p[|r, n -E[iu|>e] < £ Var[Tfc 



e 

n=l 



2 

= E ^2 E Var ^] 

n=l ™ m=l 

= ^EVax[F m ] £ ^ 

m=l n:k n >m n 

<(D i^ V ar[y m ]^ 

m— 1 
oo 1 

< c ^ ^ E ^] . 

' t it) ** 



m 

m— 1 



In (1) we used that with fc n = [a™] one has XL. fe >m k n 2 < C ■ m 2 . In the 
last step we used that Var[F m ] = E[r r 2 ] - E[Y m f~< E[Y*]. 
Lets take some breath and continue, where we have just left off: 



OO OO 

£p[|T fe „-E[T fc „]|> e ] < C^ — E[Y^} 

n—l m—1 

oo m—1 /•/+! 



< 



m=l i=0 ' 

oo oo 

C E E =3/ - 2 ^(-) 



(=0 m=i+l " 
oo oo / j . -i \ r l + l 



l=0m=l+l Jl 

oo 



<W / xdnix) 

1=0 Jl 

< C ■ E[X X ] < oo . 
In (2) we used that Em=i+i TO ~ 2 < C • (Z + I) -1 - 

We have now proved complete (=fast stochastic) convergence. This implies 
the almost everywhere convergence of T kn — E[TfcJ — > 0. 

(iii) So far, the convergence has only be verified along a subsequence k n . 
Because we assumed X n > 0, the sequence U n = X^=i ^™ = n ^ n * s mono ~ 
tonically increasing. For n € [k m ,k m +i], we get therefore 

m+l 
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and from lim n _ i . 00 Tk m = E[Xi] almost everywhere, the statement 
-E[Xi] < lim inf T n < lim sup T„ < aE[Xi] 

a n n 

follows. □ 

Remark. The strong law of large numbers can be interpreted as a statement 
about the growth of the sequence J^fe=i X n - For E[Xi] = 0, the convergence 
n Sfe=i — > means that for all e > there exists m such that for n > m 

n 

\^X n \<en. 
k=l 

This means that the trajectory Y], _, X n is finally contained in any arbi- 
trary small cone. In other words, it grows slower than linear. The exact 
description for the growth of ^fc=i X n is given by the law of the iterated 
logarithm of Khinchin which says that a sequence of IID random variables 
X n with E[X n ] = m and a{X n ) =(7^0 satisfies 

lim sup — — = + 1 , lim inf = -1 , 



with A n = \j2a 2 n\og logn. We will prove this theorem later in a special 
case in theorem (2.18.2). 

Remark. The IID assumption on the random variables can not be weakened 
without further restrictions. Take for example a sequence X n of random 
variables satisfying P[X n = ±2 n ] = 1/2. Then E[X n ] = but even S n /n 
does not converge. 



Exercise. Let Xi be IID random variables in C 2 . Define Yfe = \ J2i=i X 
What can you say about S n — ^ J2k=i ^ k? - 



2.10 The Birkhoff ergodic theorem 

In this section wc fix a probability space (0,^4., P) and consider sequences 
of random variables X n which are defined dynamically by a map T on fi 

by 

X n {uj) =X{T n (w)) , 

where T n {oS) = T(T(. ..T(w))) is the n'th iterate of uj. This can include 
as a special case the situation that the random variables arc independent, 
but it can be much more general. Similarly as martingale theory covered 
later in these notes, ergodic theory is not only a generalization of classical 
probability theory, it is a considerable extension of it, both by language as 
by scope. 



2.10. The Birkhoff ergodic theorem 
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Definition. A measurable map T : — > f2 from the probability space onto 
itself is called measure preserving, if P[T _1 (A)] = P[A] for all A e A. The 
map T is called ergodic if T(A) = A implies P[A] = or P[A] = 1. A 
measure preserving map T is called invertible, if there exists a measurable, 
measure preserving inverse T _1 of T. An invertible measure preserving map 
T is also called an automorphism of the probability space. 

Example. Let il = {\z\ = 1 } C C be the unit circle in the complex plane 
with the measure P[Arg(z) <E [a,b]] = (b - a)/(2n) for < a < b < 2tt 
and the Borel cr-algebra A. If w = e 27 " Q is a complex number of length 1, 
then the rotation T{z) = wz defines a measure preserving transformation 
on (f2,£>, P). It is invertible with inverse T (z) = z/w. 

Example. The transformation T(z) = z 2 on the same probability space as 
in the previous example is also measure preserving. Note that P[T(A)] = 
2P[A] but P[T _1 (A)] = P[A] for all Ae B. The map is measure preserving 
but it is not invertible. 

Remark. T is ergodic if and only if for any X <E C 1 the condition X(T) = X 
implies that X is constant almost everywhere. 

Example. The rotation on the circle is ergodic if a is irrational. Proof: 
with z = e 2nlx one can write a random variable X on f2 as a Fourier series 
f( z ) = EiT=-oo a " z " which is the sum / +/++/_, where /+ = *£,n=i a « z " 
is analytic in \z\ < 1 and /_ = ^2n=i a n z ~ n is analytic in \z\ > 1 and /o is 
constant. By doing the same decomposition for f(T{z)) = Yl°^=-oo a nW n z n , 
we see that / + = Y^?=i a nZ n = J2^=i a n w n z n . But these are the Taylor 
expansions of /+ = f+(T) and so a„ = a„w n . Because w n ^ 1 for irrational 
a, we deduce a n = for n > 1. Similarly, one derives a n = for n < — 1. 
Therefore f(z) = ao is constant. 

Example. Also the non-invertible squaring transformation T(x) = x 2 on 
the circle is ergodic as a Fourier argument shows again: T preserves again 
the decomposition of / into three analytic functions / = /- + /o + /+ 
so that /(TO)) = Y.n=-oo a nz 2n = ZT=-oo a « z " implies J2n=i a nZ 2n = 
a n z n . Comparing Taylor coefficients of this identity for analytic func- 
tions shows a n = for odd n because the left hand side has zero Taylor 
coefficients for odd powers of z. But because for even n = 2 l k with odd 
k, we have a n = a 2 iu = o-2 l - 1 k = ' ' ' = a k = 0, all coefficients = for 
k > 1. Similarly, one sees = for k < — 1. 

Definition. Given a random variable X <G C and a measure preserving trans- 
formation T, one obtains a sequence of random variables X n = X{T n ) G C 
by X{T n ){oj) = X(T n uj). They all have the same distribution. Define 
So = 0andS n = £Lo*( Tfc )- 
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Theorem 2.10.1 (Maximal ergodic theorem of Hopf). Given X G C and 
a measure preserving transformation T, the event A = {sup n S n > } 
satisfies 

E[X;A] = E[1 A X] > . 



Proof. Define Z n = maxo<fc< n Sk and the sets A n = {Z n > 0} C A n+ \. 
Then A = {J n A n . Clearly Z n e C 1 . For < k < n, we have Z n > S k and 
so Z n (T) > S k (T) and hence 

Z n (T)+X>S k+1 . 

By taking the maxima on both sides over < k < n, we get 

ZJT) + X > max S k . 

l<fe<n+l 

On A n = {Z n > 0}, we can extend this to Z n (T) + X > max i< & < n _|_ \ s k > 
max < fc < n+ i Sk = Z n+ i > Z n so that on A„ 

X>Z n - Z n (T) . 

Integration over the set A n gives 

E[X; A n ] > E[Z n ; A n ] - E[Z n (T);A n ] . 

Using (1) this inequality, the fact (2) that Z„ = on f2\ A n , the (3) inequal- 
ity Z n (T) > S n (T) > on A n and finally that T is measure preserving (4), 
leads to 

E[X;A n ] > (1) E[Z n ;A n }-E[Z n (T);A n ] 
= (2) E[Z n }-E[Z n (T);A n ] 
> (3) E[Z n -Z n (T)]= (i) 

for every n and so to E[X; A] > 0. □ 

A special case is if A is the entire set: 

Corollary 2.10.2. Given X <G C 1 and a measure preserving transformation 
T. If sup„ S n > almost everywhere then E[X] > 0. 



Theorem 2.10.3 (Birkhoff ergodic theorem, 1931). For any X £ £ the time 
average 

n-l 

n n * — ' 

converges almost everywhere to a T-invariant random variable X satisfying 
E[X] = E[X]. If T is ergodic, then X is constant E[X] almost everywhere 
and S n jn converges to E\X\. 



2.10. The Birkhoff crgodic theorem 
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Proof. Define X = ]iiaswp n ^, 00 S n /n, X_ = lim inf n_j.ee S n /n . We get 
X = X(T) and X = XJT) because 

71 + 1 Sn+l S n (T) _ X 

n (n + 1) n n 

(i) X = X. 

Define for (5 < a € M the set A Qi/8 = {X < /_ < a < X }. It is T- 
invariant because X, X are T-invariant as mentioned at the beginning of 
the proof. Because {X_ < X } = \Ja <a a ^-q A a $ % it is enough to show 
that P[A aj ^] = for rational (3 < a. The rest of the proof establishes this. 
In order to use the maximal ergodic theorem, we also define 

B a ,p = {sup(i_ n — na) > , sup(i_ n — n/3) < 0} 

n n 

= {sup(5„/?i — a) > 0,sup(/_ „/n — j3) < } 

n n 

D {limsup(S , n /n — a) > 0, limsup(S' n /n — j3 ) < } 

n n 

= {X-a>0,X-/3 <0} = A atf) . 

Because A a t p C B a .p and A a< p is T-invariant, we get from the maximal 
crgodic theorem ~E[X — a; A a ,p] > and so 

E[X;A a ^} >a-P[A a ,p) ■ 

Because A a ,/3 is T-invariant, we we get from (i) restricted to the system T 
on A a> p that E[X; A a ,p] = E[X; A a ^] and so 

E[__;i4~,^]>a.P[^]. (2.5) 

Replacing X, a, /? with —X, — /?, —a and using — J_ = —X shows in exactly 
the same way that 

E[X;A a> p]</3-P[Aa, ]. (2.6) 
The two equations (2. 5), (2. 6) imply that 

P?[A a> p\ > aP[A a ,p] 
which together with /3 < a only leave us to conclude P[.A~ )( g] = 0. 

(ii) le C>. 

We have \S n /n\ < \X\, and by (i) that S n /n converges pointwise to X = X 
and X <G C . The Lebesgue's dominated convergence theorem (2.4.3) gives 
X eC 1 . 

(iii) E[X] = E[X}. 

Define the T-invariant sets B k . n = {X _ [|, *±i)} for fc G Z,n > 1. Define 
for e > the random variable Y = X — — + e and call the sums where 
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X is replaced by Y. We know that for n large enough sup n S n > on 
Bk, n - When applying the maximal ergodic theorem applied to the random 
variable Y on Bfe.n- we get E[Y;-Bfc in ] > 0. Because e > was arbitrary, 

E[X;S fe ,„] > ^P[B k , n ] . 

With this inequality 

E[X,B ktn ] < *±ip[fl fc ] < Ip[B fcin ]+*P[B M ] < -P[B fc ,„]+E[X;B fc , n ] 
n n n n 



Summing over k gives 



E[X] < - +E\X] 
n 



and because n was arbitrary, E[X] < E[X]. Doing the same with —X we 
end with 

E[-X] = E [-X] < E[^X] < E[-X] . 

□ 



Corollary 2.10.4. The strong law of large numbers holds for IID random 
variables X n S C . 



Proof. Given a sequence of IID random variables X n g C . Let /i be the 
law of X n . Define the probability space VI = (M z ,„4, P), where P = /i z is 
the product measure. If T : Q Q, T(ui) n = td n +i denotes the shift on il, 
then X n = X(T n ) with with X(lo) = ujq. Since every T-invariant function 
is constant almost everywhere, we must have X — E[X] almost everywhere, 
so that Sn/n — > E[X] almost everywhere. □ 

Remark. While ergodic theory is closely related to probability theory, the 
notation in the two helds is often different. The reason is that the origin 
of the theories are different. Ergodic theorists usually write (X, A, m) for 
a probability space, not (CI, A, P). Of course an ergodic theorists looks at 
probability theory as a special case of her field and a probabilist looks at 
ergodic theory as a special case of his field. An other example of different 
language is also that ergodic theorists do not use the word "random vari- 
ables" X but speak of "functions" /. This sounds different but is the same. 
The two subjects can hardly be separated. Good introductions to ergodic 
theory are [37, 13, 8, 79, 55, 112]. 



2.1 1. More convergence results 

2.11 More convergence results 



77 



We mention now some results about the almost everywhere convergence of 
sums of random variables in contrast to the weak and strong laws which 
were dealing with averaged sums. 



Theorem 2.11.1 (Kolmogorov's inequalities), a) Assume X k <E C 2 are inde- 
pendent random variables. Then 

P[ sup \S k - E[S k }\ > e] < \vav[S n ] . 

l<k<n £ 

b) Assume X k G C°° are independent random variables and ||Jf n ||oo R- 
Then 

P[ sup \S k ~ E[S k )\ > e] > 1 - ^i R t £) " 1 • 

l<k<n l~ik=\ Var [ A fcJ 



Proof. We can assume E[Afe] = without loss of generality, 
a) For 1 < k < n we have 

S 2 n - S 2 = (S„ - S k ) 2 + 2(S n - S k )S k > 2(S n - S k )S k 

and therefore E[S 2 ;A k ] > E[S 2 k \A k ] for all A k e a(X u ...,X k ) by the 
independence of S n — S k and S k . The sets A\ = {\S\\ > e}, A k+ \ = 
{|<Sfc+i| > e,niaxi<;<fe \Si\ < e} are mutually disjoint. We have to estimate 
the probability of the events 

n 

B„ = {max \S k \>e}={jA k . 

Kk<n ^-^ 
~ ~ fe=l 

We get 

n n n 

E[5 r 2 J > E[S 2 n , B n ] = £ HS 2 n ; A k ] > Y E[S 2 ; A k ] > e 2 ]T P[A k ] = e 2 P[B n ] ■ 

fc=i fc=i fe=i 

b) 

E[S 2 k ;B n ) = E[S 2 } E[S 2 ; B c n ] > E[S 2 } - e 2 (l - P[B n ]) . 
On A k , \S k -!\ < e and \S k \ < \S k ^\ + \X k \ <e + R holds. We use that in 
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the estimate 



so that 

and so 
P[B„] > 



E[Sl;B n ] = £)E[5g + (S n - S k f; A k ] 

k=l 

n n 

= ^E[^;A fc ] + ^E[(5„-^) 2 ;A fc ] 

k=l k=l 

n n n 

< (R+e) 2 J2P[A k ]+J2nAk] E Var ^- 

fc=l fc=l i=fc+i 

< P^Kle + ^+E^]) 

E[S 2 ] < P[B n ]((e + i?) 2 + E[S. 2 ]) + e 2 - e 2 P[B„] . 



E[S 2 ]-6 2 (e + fl) 2 (e + fi) s 

(e + Rf + E[5„] - e 2 " (e + Rf + E[S*] ~ e 2 " E[S£] 



□ 



Remark. The inequalities remain true in the limit n — > oo. The first in- 
equality is then 

P[sup \S k - E[S k ]\ > e] < - ]T Var ^-] ■ 
fc 6 fe=i 

Of course, the statement in a) is void, if the right hand side is infinite. In 
this case, however, the inequality in b) states that sup fc \S k — E[£fc]| > e 
almost surely for every e > 0. 

Remark. For n = 1, Kolmogorov's inequality reduces to Chebychev's in- 
equality (2.5.5) 



Lemma 2.11.2. A sequence X n of random variables converges almost ev- 
erywhere, if and only if 

lim P[sup|X n+fe -X„|>e]=0 

n->oo fe > x 

for all e > 0. 



Proof. This is an exercise. 



□ 



2.11. More convergence results 
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Theorem 2.11.3 (Kolmogorov). Assume X n € C? are independent and 
EZi Var[X„] < oo. Then 

OO 

n=l 

converges almost everywhere. 



Proof. Define F„ = X n - E[X n ] and #„ = J2l=i Y k- Given m £ N. Apply 
Kolmogorov's inequality to the sequence Y rn+ ^ to get 

^ oo 

P[sup |5„-5 m | >e] < - ^ E[r fe 2 ]^0 

— k—rn+1 

for m — > oo. The above lemma implies that S n (j-<j) converges. □ 



Figure. We sum up indepen- 
dent random variables Xk 
which take values i~ with 
equal probability. According to 
theorem (2. 11. 3), the process 

n n 
k=l k=l 

converges if 

oo oo 
k=l k=l 

converges. This is the case if a > 
1/2. The picture shows some ex- 
periments in the case a = 0.6. 

The following theorem gives a necessary and sufficient condition that a 
sum Sn 

— J2k=i Xk converges for a sequence X n of independent random 
variables. 

Definition. Given R £ R and a random variable X, we define the bounded 
random variable 

X (R) = 1\x\<rX . 
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Theorem 2.11.4 (Three series theorem). Assume X n € C be independent. 
Then J^Li X n converges almost everywhere if and only if for some R > 
all of the following three series converge: 

oo 

J2n\Xk\>R] < oo, (2.7) 

k=l 

oo 

j2\n4 R) }\ < °°> (2-8) 

fc=i 

oo 

X)Var[^] < oo. (2.9) 

k=i 



Proof. "=>" Assume first that the three series all converge. By (3) and 
Kolmogorov's theorem, we know that YlkLii^k — E[A^]) converges 
almost surely. Therefore, by (2), Y^,k=i X k converges almost surely. By 
(1) and Borel-Cantelli, P[X k ^ x[ R) infinitely often) = 0. Since for al- 
most all oj, x\^\lj) = Xk(uj) for sufficiently large k and for almost all 

u, Ylk=i xI, R \lu) converges, we get a set of measure one, where YlkLi X k 
converges. 

"•<=" Assume now that X^^Li X n converges almost everywhere. Then Xk — > 
almost everywhere and P[|Afc| > R, infinitely often) = for every R > 0. 
By the second Borel-Cantelli lemma, the sum (1) converges. 
The almost sure convergence of J^Li X n implies the almost sure conver- 
gence of Y^=i Xn since P[|Xfe| > R, infinitely often) = 0. 
Let R > be fixed. Let Y k be a sequence of independent random vari- 

( R) 

ables such that Yk and Xf, have the same distribution and that all the 

random variables X^\Y k are independent. The almost sure convergence 

of En=i x n R) implies that of £^=1^ - Y k . Since E[X { k R) - Y k ] = 
At 



T n = Y2=i X i R) - Y k satisfies for all e > 



( R) 

and P[|^Q. — Yk\ < 2R) — 1, by Kolmogorov inequality b), the series 



P[sup|T n+fc -T„| >e]> 1 



Claim: J2T=i VarLY^ - Y k ] < oo. 

Assume, the sum is infinite. Then the above inequality gives P[sup fc> \T n+ k— 

T n \ > e] = 1. But this contradicts the almost sure convergence of J2T=i X k^~ 
Yk because the latter implies by Kolmogorov inequality that P[sup fc>1 \S n +k~ 

S n \ > e] < 1/2 for large enough n. Having shown that £fcLi(Var[X^ — 
Yk)] < oo, we are done because then by Kolmogorov's theorem (2.11.3), 
the sum J^'kLii-^i^ ~ ^[X^]) converges, so that (2) holds. 

□ 
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Figure. A special case of the 
three series theorem is when Xk 
are uniformly bounded Xk < 
R and have zero expectation 
E[Xfc] = 0. In that case, almost 
everywhere convergence of S n = 
y~]u—i Xk is equivalent to the 
convergence of J2T=i Var[Xfc] . 
For example, in the case 



X k 



and a = 1/2, we do not have 
almost everywhere convergence 
of S n , because J2T=i^ ai i x k] = 

oo 1 

fe=l I 




Eoo 1 
i_i T = OO 



Definition. A real number a G R is called a median of X G C if P[X < 

a \ > 1/2 and P[X > a] > 1/2. We denote by med(X) the set of medians 
of X. 

Remark. The median is not unique and in general different from the mean. 
It is also defined for random variables for which the mean does not exist. 

The median differs from the mean maximally by a multiple of the standard 
deviation: 



Proposition 2.11.5. (Comparing median and mean) For Y G C 2 . Then every 
a G med(y) satisfies 

|a-E[y]| < V2a[Y] . 



Proof. For every /? G K, one has 
l«-/3| 2 



< \a - /3\ 2 mm(P[Y > a],P[Y < a}) < E[(Y - P) 2 ] 



Now put P = E[F] 



□ 



Theorem 2.11.6 (Levy). Given a sequence X n G C which is independent. 
Choose cti t k G med(S'; — Sk). Then, for all n G N and all e > 



P[ max \S k + On,* | > e] < 2P[\S n \ > e] 

KKn 
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Proof. Fix n £ N and e > 0. The sets 

Ai = {Si + a n>1 > e }, A k+1 = { max (Si + a Hjl ) < e, S k+1 + a riik+ i > e } 

for 1 < k < n arc disjoint and Ufc=i A k = {maxi< k < n (S k + a nk ) > e }. 
Because {S n > e } contains all the sets A k as well as {S n — S k > a nk } for 
1 < k < n, we obtain using the independence of cr(A k ) and o~(S n — S k ) 

n 

P[S„>e] > ^P[{S„-5 fc >a„, fc }nA fe ] 
fe=i 

n 

= £]P[{S„-S fe >a„, fc }]P[A fc ] 
fc=i 
i n 

> ^E p ^] 

fc=i 

1 n 

= ^l\jA k ] 

fe=i 

= ip[ max (5„ + a„,fc) > e] . 

Z l<k<n 

Applying this inequality to —X n , we get also P[— S m — a n .m > — e] > 
2P[-5„ > -e] and so 

P[ max |5 fe + a n , fe | > e] < 2P[|S r „| > e] . 

1<K<TI 

□ 



Corollary 2.11.7. (Levy) Given a sequence X n £ C of independent random 
variables. If the partial sums S n converge in probability to S, then S n 
converges almost everywhere to S. 



Proof. Take ai tk £ med(S'/ — Sk)- Since S n converges in probability, there 
exists mi £ N such that \ai. k \ < e/2 for all m\ < k < I. In addition, 
there exists m 2 £ N such that sup n>1 P[|S„ +m — 5 m | > e/2] < e/2 for all 
m > m2- For m = max{mi, "12}, we have for n > 1 

P[ max |5; +m - SVnl > e] < P[ max |Sz +m - <Sm + a„ +m .; +m | > e/2] . 

1</<7Z l<l<n 

The right hand side can be estimated by theorem (2.11.6) applied to X n+m 
with 

< 2P[|5„ +m - S m \ > |] < e . 
Now apply the convergence lemma (2.11.2). □ 



2.12. Classes of random variables 



83 



Exercise. Prove the strong law of large numbers of independent but not 
necessarily identically distributed random variables: Given a sequence of 
independent random variables X n S C 2 satisfying E[JT n ] = m. If 

oo 

^VarLY fe ]/fc 2 < oo , 

k=l 

then S n /n — > m almost everywhere. 

Hint: Use Kolmogorov's theorem for Yfe = Xk/k. 



Exercise. Let X n be an IID sequence of random variables with uniform 
distribution on [0, 1]. Prove that almost surely 

oo n 

71— 1 2 — 1 

Hint: Use Var[T7 ? X t ] = J] E[X 2 ] - rjE[X ( ] 2 and use the three series theo- 
rem. 



2.12 Classes of random variables 

The probability distribution function Fx ■ R — > [0,1] of a random variable 
X was defined as 

F x (x) = PLY < x] , 

where P[X < x] is a short hand notation for P[{w G | X(oj) < x }. With 
the law n x = X*P of X on R has F x {x) = f* d^{x) so that F is the 
anti-derivative of [i. One reason to introduce distribution functions is that 
one can replace integrals on the probability space f2 by integrals on the real 
line R which is more convenient. 

Remark. The distribution function Fx determines the law fix because the 
measure v((—oo,a]) = F x (a) on the 7r-system I given by the intervals 
{(— oo,a]} determines a unique measure on R. Of course, the distribution 
function does not determine the random variable itself. There are many 
different random variables defined on different probability spaces, which 
have the same distribution. 
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Proposition 2.12.1. The distribution function Fx of a random variable is 



a) non-decreasing, 

b) Fx(-oo) = 0,F x (oo) = 1 

c) continuous from the right: Fx(x + h) Fx for h — >• 0. 

Furthermore, given a function F with the properties a),6),c), there exists 
a random variable X on the probability space (fl,A,P) which satisfies 
Fx = F. 



Proof, a) follows from {X < x } C {X < y } for x < y. b) P[{X < -n}] -> 
and P[{X < n}] -> 1. c) F x (x + h) - F x = P[x < X < x + h] -> for 
ft -> 0. 

Given F, define f2 = M and A as the Borel tr-algebra on R. The measure 
P[(— oo,a]] = F [a] on the 7r-system X defines a unique measure on (Q,A). 



Remark. Every Borel probability measure fionR determines a distribution 
function Fx of some random variable X by 



The proposition tells also that one can define a class of distribution func- 
tions, the set of real functions F which satisfy properties a), b), c). 

Example. Bertrands paradox mentioned in the introduction shows that the 
choice of the distribution functions is important. In any of the three cases, 
there is a distribution function f(x,y) which is radially symmetric. The 
constant distribution f(x, y) = l/vr is obtained when we throw the center of 
the line into the disc. The disc A r of radius r has probability P[A r ] = r 2 /ir. 
The density in the r direction is 2rjix. The distribution f(x,y) = 1/r = 
\j \/ x 2 + y 2 is obtained when throwing parallel lines. This will put more 
weight to center. The probability P[A r ] = r/ir is bigger than the area of 
the disc. The radial density is 1/tt. f(x,y) is the distribution when we 
rotate the line around a point on the boundary. The disc A r of radius r 
has probability arcsin(r). The density in the r direction is 1/yl — r 2 . 



□ 
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0.2 0.4 0.6 0.8 1 



Figure. A plot of the radial 
density function f(r) for the 
three different interpretation of 
the Bertrand paradox. 
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0.2 0.S 0.6 0.8 1 



Figure. A plot of the radial dis- 
tribution function F(r) = P[A r ] 
There are different values at 
F(l/2). 



So, what happens, if we really do an experiment and throw randomly lines 
onto a disc? The punch line of the story is that the outcome of the ex- 
periment very much depends on how the experiment will be performed. If 
we would do the experiment by hand, we would probably try to throw the 
center of the stick into the middle of the disc. Since we would aim to the 
center, the distribution would be different from any of the three solutions 
given in Bertrand's paradox. 



Definition. A distribution function F is called absolutely continuous (ac) , if 
there exists a Borel measurable function / satisfying F(x) = J_ ao f(%) dx. 
One calls a random variable with an absolutely continuous distribution 
function a continuous random variable. 

Definition. A distribution function is called pure point (pp) or atomic if 
there exists a countable sequence of real numbers x„ and a sequence of 
positive numbers p n ,2\2 n Pn = 1 such that F(x) = 2~2n x n <xPn- One calls 
a random variable with a discrete distribution function a discrete random 
variable. 

Definition. A distribution function F is called singular continuous (sc) if F 
is continuous and if there exists a Borel set S of zero Lebcsgue measure such 
that hf{S) = 1. One calls a random variable with a singular continuous 
distribution function a singular continuous random variable. 

Remark. The definition of (ac),(pp) and (sc) distribution functions is com- 
patible for the definition of (ac),(pp) and (sc) Borel measures on R. A Borel 
measure is (pp), if (J-(A) = X^eA ^(i -})- It i s continuous, if it contains no 
atoms, points with positive measure. It is (ac), if there exists a measurable 
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function / such that fx = f dx. It is (sc), if it is continuous and if fx(S) = 1 
for some Borel set S of zero Lebesgue measure. 

The following decomposition theorem shows that these three classes are 
natural: 



Theorem 2.12.2 (Lebesgue decomposition theorem). Every Borel measure 
fx on (R,23) can be decomposed in a unique way as fx = fx pp + fx ac + Li sc , 
where fx pp is pure point, fx sc is singular continuous and fx ac is absolutely 
continuous with respect to the Lebesgue measure A. 



Proof. Denote by A the Lebesgue measure on (K, 23) for which A([a, b}) = 
& — a. We first show that any measure fi can be decomposed as fi = Li ac + Li s , 
where fi ac is absolutely continuous with respect to A and fi s is singular. 
The decomposition is unique: fx = fx£} + fx^ = fx^a), + f^p implies that 
fxkc — l*ac = /Us — fA is both absolutely continuous and singular with 
respect to fx which is only possible, if they are zero. To get the existence 
of the decomposition, define c = sup^^ a(A)=o M^)- If c = 0, then fx is 
absolutely continuous and we arc done. If c > 0, take an increasing sequence 
A n G 23 with fx(A n ) —> c. Define A — \J n>1 A n and fi s as fx s {B) = fi{AC\B). 
To split the singular part fx s into a singular continuous and pure point part, 
we again have uniqueness because fx s = fx^J +fi{ 1 J = fi p 2 J + fi p 2 J implies that 

v = fisc — (J-sc = fJ-jvp — fJ-pp are both singular continuous and pure point 
which implies that v = 0. To get existence, define the finite or countable 
set A = {uj | fx(tjj) > } and define fi pp (B) = fx(A n B). □ 

Definition. The Gamma function is defined for x > as 

/■OO 

r(a:) = / t x - x e- 1 dt . 
Jo 

It satisfies T(n) = (n — 1)1 for n S N. Define also 

B(p,q) = f x p ~ 1 (l~x) q - 1 dx , 
Jo 

the Beta function. 

Here are some examples of absolutely continuous distributions: 

acl) The normal distribution N(m, a 2 ) on il = K has the probability den- 
sity function 

f{x) = v^ e ~^ ■ 
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ac2) The Cauchy distribution onO =1 has the probability density function 

1 h 

m = - 



it b 2 + (x — m) 2 

ac3) The uniform distribution on 51 = [a, b] has the probability density 
function 

1 



/(*) 



b — a 



ac4) The exponential distribution A > on £1 = [0, oo) has the probability 
density function 



fix) = Xe 



-Xx 



ac5) The log normal distribution on £1 = [0, oo) has the density function 



f{x) = = e 
\2nx 2 <j 2 



ac6) The beta distribution on O = [0,1] with p > 1, q > 1 has the density 

, xP- l {l-x)"- x 



ac7) The Gamma distribution on £1 = [0, oo) with parameters a > 0, /? > 




Figure. The probability density 
and the CDF of the normal dis- 
tribution. 



Figure. The probability density 
and the CDF of the Cauchy dis- 
tribution. 
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Figure. The probability density 
and the CDF of the uniform dis- 
tribution. 




Figure. The probability density 
and the CDF of the exponential 
distribution. 



Definition. We use the notation 



k J (n-k)\k\ 

for the Binomial coefficient, where k\ = k(k— l)(fc — 2) • • • 2- 1 is the factorial 

of k with the convention 0! = 1. For example, 

3° ) = ^§ = 10 * 9 * 8/6 = 120 . 

Examples of discrete distributions: 

ppl) The binomial distribution onfi = {l,...,n} 

P[x = k} = ( I ^p k (i- P ) n - k 

pp2) The Poisson distribution on fi = N 

P{X = k] = e-^ 

pp3) The Discrete uniform distribution on = {l,..,n } 

1 



P[X = k] 



n 



pp4) The geometric distribution on SI = N = {0, 1, 2, 3, ... } 

P[X = k] = p(l - p) k 



pp5) The distribution of first success on Q = N \ {0} = {1,2,3,...} 

P[X = k] = P (l-p) k - 1 
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Figure. The probabilities and the 
CDF of the binomial distribution. 



°°° j IBllll 



Figure. The probabilities and the 
CDF of the uniform distribution. 



8!) 




Figure. The probabilities and the 
CDF of the Poisson distribution. 




Figure. The probabilities and the 
CDF of the geometric distribution. 



An example of a singular continuous distribution: 



scl) The Cantor distribution. Let C = H^Lo ^™ ^ e ^ ne Cantor set, 
where E = [0, 1], E x = [0, 1/3] U [2/3, 1] and E n is inductively 
obtained by cutting away the middle third of each interval in 
E n -\, Define 

F(x) = lim F n (x) 

n— >oo 

where F n (x) has the density (3/2)™-1e„- One can realize a random 
variable with the Cantor distribution as a sum of IID random 
variables as follows: 

oo v 

x = y^, 

n=l 6 

where X n take values and 2 with probability 1/2 each. 
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Figure. The CDF of the Cantor 
distribution is continuous but not 
absolutely continuous. The func- 
tion Fx{x) is in this case called 
the Cantor function. Its graph is 
also called a Devils staircase 



Lemma 2.12.3. Given X £ C with law /j,. For any measurable map h : R 1 — > 
[0,oo) for which h(X) £ £ , one has E[/j(AT)] = J M h(x) d/j,(x). Especially, 
if /t = /i ac = / dx then 

E[h{X)] = [ h{x)f(x) dx . 

JR 

If /i = /ipp, then 

E[h(X)} = Yl ■ 



Proof. If the function h is nonnegative, prove it first for X = cl xe A, then 
for step functions X £ S and then by the monotone convergence theorem 
for any X £ C for which h(x) £ C . If h{X) is integrable, then E[/i(X)] = 
E[h+{X)] - E[h~(X)}. □ 



Proposition 2.12.4. 



Distribution 


Parameters 


Mean 


Variance 


acl) Normal 


m £ K, a 2 > 


TO 


a y 


ac2) Cauchy 


m £ K, b > 


"to" 


00 


ac3) Uniform 


a < b 


(a + 6)/2 


(&-a)7l2 


ac4) Exponential 


A > 


1/A 


VA a 


ac5) Log-Normal 


to £ E, a 2 > 




(e<^ - l)e 2m+aZ 


ac6) Beta 


P, <7 > 


p/ + 9) 


pq 

(p+q)' 2 (p+q+i) 


ac7) Gamma 


a,/3 > 


a/3 


a(3 2 
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Proposition 2.12.5. 



ppl) Bernoulli 


n e N, p G [0, 1] 


np 


np(l — p) 


pp2) Poisson 


A > 


A 


A 


pp3) Uniform 


n e N 


(1 +n)/2 


(n a - 1)/12 


pp4) Geometric 


pe (0,1) 


(l-p)/p 


(1-P)/P 2 


pp5) First Success 


pe (0,1) 


1/p 


(i-p)/p* 


scl) Cantor 




1/2 


1/8 



Proof. These are direct computations, which we do in some of the examples: 
Exponential distribution: 



E 

Poisson distribution 



[X p ] = f x p \e~ Xx dx = ^EpP" 1 ] = 
Jo * 



<\» 



OO , oo 



ft! ^(/c-1)! 

fc=0 k=l y ' 

For calculating higher moments, one can also use the probability generating 
function 

and then differentiate this identity with respect to z at the place z = 0. We 
get then 

ELY] = A, E[Y(Y - 1)] = A 2 , ELY 3 ] = E[X(X - 1)(X - 2)], . . . 
so that E[Y 2 ] = A + A 2 and VarLY] = A. 

Geometric distribution. Differentiating the identity for the geometric series 



oo 
fe=0 



gives 

oo 
fc=0 

Therefore 



1 



1 -x 



{1-xf 



P 1 

-P) = : 

k=0 k=Q k=l 



E[X P ] = ]T fe(l - p) k p = £ fc(l p) k p = p £ - = 2L = - 



For calculating the higher moments one can proceed as in the Poisson case 
or use the moment generating function. 

Cantor distribution: because one can realize a random variable with the 
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Cantor distribution as X = Y^=i An/3™, where the IID random variables 
X n take the values and 2 with probability p =1/2 each, we have 

ELY ] \ \ \ 

E {X] = E -pT = E yT = ~ 1= 2 

and 

n—1 n—1 n—1 

See also corollary (3.1.6) for an other computation. □ 

Computations can sometimes be done in an elegant way using character- 
istic functions 4>x(t) = E[e ztx ] or moment generating functions Mx(t) = 
E[e tx ]. With the moment generating function one can get the moments 
with the moment formula 

For the characteristic function one obtains 

E[X"]= / ^d/x=H)"^(t)| t=0 . 
Jr m 



Example. The random variable X(x) = x has the uniform distribution 
on [0,1]. Its moment generating function is Mx{t) = J e tx dx = (e* — 
1) jt = l + i/2! + i 2 /3! + ....A comparison of coefficients gives the moments 
E[A m ] = 1/(to + 1), which agrees with the moment formula. 

Example. A random variable X which has the Normal distribution N(m,a 2 ) 
has the moment generating function Mx(t) = e tm+cr * 1 2 . All the moments 
can be obtained with the moment formula. For example, E[A] = M' x (0) = 
to, E[A 2 ] = M£(0) = to 2 + a 2 . 

Example. For a Poisson distributed random variable X on Q = N = 
{0,1, 2, 3,...} with P[X = k] = e~ A ^r, the moment generating function is 

oo 

M x (t) = E P I X = k ^ k = eA(1 ~ et) • 

fc=0 



Example. A random variable X on O = N = {0,1,2,3,... } with the 
geometric distribution P[X = k] = p(l — p) k has the moment generating 
function 

oo oo 

Mx(t) = V e kt p(l - p) k = pV((l - p)e t ) k = 
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A random variable X on = {1, 2, 3, ... } with the distribution of first 
success P[X = k] = p(l — p) fe_1 , has the moment generating function 



Mx (t) = y: e kt P {i - P r 1 = e' P £((i - pyt = T 

k=l k=o ^ p ' 



Exercise. Compute the mean and variance of the Erlang distribution 

on the positive real line O = [0, oo) with the help of the moment generating 
function. If k is allowed to be an arbitrary positive real number, then the 
Erlang distribution is called the Gamma distribution. 



Definition. The kurtosis of a random variable X is defined as Kurt[X] = 
E[(X-E[X]) 4 ]/cr[X 4 ]. The excess kurtosis is defined as Kurt [X] -3. Excess 
kurtosis is often abbreviated by kurtosis. A distribution with positive excess 
kurtosis appears more peaked, a distribution with negative excess kurtosis 
appears more flat. 



Exercise. Verify that if X, Y are independent random variables of the same 
distribution then the kurtosis of the sum is the average of the kurtosis 
Kmt[X + Y] = (Kurt[A] + Kurt[F])/2. 



Exercise. Prove that for any a, b the random variable Y = aX + b has the 
same kurtosis Kurt[Y] = Kurt [A]. 



Exercise. Show that the standard normal distribution has zero excess kur- 
tosis. Now use the previous exercise to see that every normal distributed 
random variable has zero excess kurtosis. 



Lemma 2.12.6. If A, Y are independent random variables, then their mo- 
ment generating functions satisfy 



M x+Y (t) = M x (t) ■ M Y (t) 
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Proof. If Y and Y are independent, then also e and e are independent. 
Therefore, 



Example. The lemma can be used to compute the moment generating 
function of the binomial distribution. A random variable Y with bino- 
mial distribution can be written as a sum of IID random variables Yj 
taking values and 1 with probability 1 — p and p. Because for n = 1, 
we have Mxi(t) = (1 — p) +pe , the moment generating function of X 
is Mx(t) = [(1 — p) +pe t ] n . The moment formula allows us to compute 
moments ELY™] and central moments E[(Y — ELY])™] of X. Examples: 



ELY] = np 

ELY 2 ] = np(l -p + np) 

VarLY] = E[(Y - E[Y]) 2 ] = E[Y 2 ] - E[Y] 2 = np(l - p) 

ELY 3 ] = np(l + 3(n - l)p + (2 - 3n + n 2 )p 2 ) 
ELY 4 ] = + 7(n - l)p + 6(2 - 3n 

+n 2 )p 2 + (-6 + lln - 6ti 2 + n 3 )p 3 ) 

E[(Y - E[X]) 4 } = E[Y 4 ] -8E[Y]E[Y 3 ] +6E[Y 2 ] 2 +E[Y] 4 

= np(l - p)(l + (5n - 6)p - (-6 + n + 6n 2 )p 2 ) 



Example. The sum X + Y of a Poisson distributed random variable Y with 
parameter A and a Poisson distributed random variable Y with parameter 
fj, is Poisson distributed with parameter X + p, as can be seen by multiplying 
their moment generating functions. 

Definition. An interesting quantity for a random variable with a continuous 
distribution with probability density fx is the Shannon entropy or simply 
entropy 



Without restricting the class of functions, H(X) is allowed to be — oo or 
oc. The entropy allows to distinguish several distributions from others by 
asking for the distribution with the largest entropy. For example, among all 
distribution functions on the positive real line [0, oo) with fixed expectation 
m = 1/A, the exponential distribution Ae~ A is the one with maximal en- 
tropy. We will return to these interesting entropy extremization questions 
later. 

Example. Let us compute the entropy of the random variable X(x) = x rn 

on ([0, 1], B, dx). We have seen earlier that the density of Y is fx(x) = 
x i/m-i j m SQ j. na j. 



E[e t ( x+y '] - E[e tx e tY 



}=E[e tx ]E[e tY } = M x (t)-M Y (t) . 



□ 





2.13. Weak convergence 



95 



To compute this integral, note first that f(x) = x a log(a; a ) = ax a log(a;) has 
the antiderivative ax 1+a ((l+a) log(x)-l)/(l+a) 2 so that / x a \og(x a ) dx = 
-a/{l + a 2 ) and ff(A) = (l-m+log(m)). Because ^H(X m ) = (l/m)-l 
and -£^H(X m ) = — 1/m 2 , the entropy has its maximum at m = 1, where 
the density is uniform. The entropy decreases for m — > oo. Among all ran- 
dom variables X(x) = x m , the random variable X{x) = x has maximal 
entropy. 



Figure. The entropy of the ran- 
dom variables X(x) = x m on 
[0, 1] as a function of m. The 
maximum is attained for m = 1, 
which is the uniform distribution 




2.13 Weak convergence 

Definition. Denote by C{,(K) the vector space of bounded continuous func- 
tions on K. This means that ||/||oo = sup xgR \f(x)\ < oo for every / £ 
Cb(M). A sequence of Borel probability measures fi n on R converges weakly 
to a probability measure /Lt on R if for every / <S Cb(R) one has 



/ dn n -> / / da. 



in the limit n — > oo. 



Remark. For weak convergence, it is enough to test J m f dfi n — > f R fdu, 

for a dense set in C(,(R). This dense set can consist of the space P(R) of 

polynomials or the space Cf°(R) of bounded, smooth functions. 

An important fact is that a sequence of random variables X n converges 

in distribution to X if and only if E[/i(X„)] — > E[/i(A)] for all smooth 

functions h on the real line. This will be used the proof of the central limit 

theorem. 

Weak convergence defines a topology on the set Mi (R) of all Borel proba- 
bility measures on R. Similarly, one has a topology for Mi ([a, b]). 
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Lemma 2.13.1. The set M\{I) of all probability measures on an interval 
I = [a, b] is a compact topological space. 



Proof. We need to show that any sequence p n of probability measures on 
/ has an accumulation point. The set of functions fk{x) = x k on [a, b] span 
all polynomials and so a dense set in Cb([a,b]). The sequence p n converges 
to p if and only if all the moments J x k dp n converge for n — s- oo and for 
all k £ N. In other words, the compactness of Mi([a, b]) is equivalent to the 
compactness of the product space / N with the product topology, which is 
Tychonovs theorem. □ 



Remark. In functional analysis, a more general theorem called Banach- 
Alaoglu theorem is known: a closed and bounded set in the dual space X* 
of a Banach space X is compact with respect to the weak-* topology, where 
the functionals p n converge to p if and only if p n (f) converges to p(f) for 
all / G X. In the present case, X = Cb[a,b] and the dual space X* is the 
space of all signed measures on [a, b] (see [7]). 

Remark. The compactness of probability measures can also be seen by 
looking at the distribution functions -F M (s) = /.*((— oo, s]). Given a sequence 
F n of monotonically increasing functions, there is a subsequence F„ k which 
converges to an other monotonically increasing function F, which is again 
a distribution function. This fact generalizes to distribution functions on 
the line where the limiting function F is still a right-continuous and non- 
decreasing function Helly's selection theorem but the function F does not 
need to be a distribution function any more, if the interval [a, b] is replaced 
by the real line ML 

Definition. A sequence of random variables X n converges weakly or in law 

to a random variable X, if the laws /j,x„ of X n converge weakly to the law 
ii x of X. 

Definition. Given a distribution function F, we denote by Cont(F) the set 
of continuity points of F. 

Remark. Because F is nondecreasing and takes values in [0,1], the only 
possible discontinuity is a jump discontinuity. They happen at points ij, 
where a, = l^{{U}) > 0. There can be only countably many such disconti- 
nuities, because for every rational number p/q > 0, there arc only finitely 
many a, with a, > p/q because a% < 1. 

Definition. We say that a sequence of random variables X n converges in 
distribution to a random variable X, if Fx n (x) —> Fx{x) point wise for all 
x e Cont(F). 
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Theorem 2.13.2 (Weak convergence = convergence in distribution). A se- 
quence X n of random variables converges in law to a random variable X if 
and only if X n converges in distribution to X. 



Proof, (i) Assume we have convergence in law. We want to show that we 
have convergence in distribution. Given s £ Cont(/) and 8 > 0. Define a 
continuous function l(_oo, s ] < / < l(-co.s+(5]- Then 



F n(s) = / l(-oo„s] dll n < if dfl n < / l(_oo iS+ 5] d/j, n = F n (s + 5) . 
JR JR JR 

This gives 

lim sup ^(s) < lim / / dfi n = f d[i < F(s + S) . 

Similarly, we obtain with a function 1(_ 00,3-51 < / < l(-oo,i5] 

liminf F n (s) > lim / / d/J, n = If dn,> F(s — S) . 

Since F is continuous at x we have for 5 — > 0: 

F(s) = lim F(s - S) < liminf F„(s) < limsupF„(s) < F(s) . 

5->0 n-s-oo n->oo 

That is we have established convergence in distribution, 
(ii) Assume now we have no convergence in law. There exists then a con- 
tinuous function / so that J f dii n to J f d[i fails. That is, there is a 
subsequence and e > such that | J f dfi nk — J f dii\ > e > 0. There exists 
a compact interval I such that | Jj f d[i nk — Jj f dfi\ > e/2 > and we 
can assume that ii nk and [i have support on /. The set of all probability 
measures on / is compact in the weak topology. Therefore, a subsequence 
of n nk converges weakly to a measure v and \v{f) — > e/2. De- 

fine the 7r-system T of all intervals {(—00, s] | s continuity point of F }. 
We have /i„((— oo,s]) = Fx n (s) — > Fx(s) = n{— 00, s]). Using (i) we see 
Hn k ((—00, s\) — > ^(—00, s] also, so that /i and v agree on the 7r system X. If 
fi and v agree on I, they agree on the 7r-system of all intervals {(— 00, s]}. 
By lemma (2.1.4), we know that /i = v on the Borcl cr-algebra and so /.i = v. 
This contradicts \v{f) — /-*(/)! > e/2. So, the initial assumption of having 
no convergence in law was wrong. □ 

2.14 The central limit theorem 

Definition. For any random variable X with non-zero variance, we denote 

by 

(X-E[X]) 

X ~ *W 
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the normalized random variable, which has mean E[A*] = and variance 
cr{X*) = y/Vai[X*] = 1. Given a sequence of random variables Xk, we 
again use the notation S n = 2fc=i Xk- 



Theorem 2.14.1 (Central limit theorem for independent L 3 random vari- 
ables). Assume Xi € C? are independent and satisfy 

1 " 

M = sup llXjlk < oo, S = liminf — Var[Xi] > . 
% i=l 

Then 5* converges in distribution to a random variable with standard 
normal distribution N(0, 1): 

1 f x 

lim P[S* < x] = —= \ e- y2/2 dy, Vi € t . 

™- yo ° V27T J-oo 




Figure. TTie probabil- 
ity density function 
fs* of the random 
variable X(x) = x on 

[-1,1]- 



Figure. TTie probabil- 
ity density function 
fs* of the random 
variable X(x) = x on 

[-1,1]- 



Figure. TTie probabil- 
ity density function 
fs* °f th e random 
variable X{x) — x on 

[-1,1]- 



Lemma 2.14.2. A iV(0, cr 2 ) distributed random variable X satisfies 

E[|xH = ^2^Vr(i( P + i)). 

\/7r z 



Especially EflXl 
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Proof. With the density function f(x) = (2 7 r ( T 2 )" 1 / 2 e"^ , we have E[\X\p] -- 
2 J °° x p f(x) dx which is after a substitution z = x 2 /(2a 2 ) equal to 



-^V / x^ + V~ l e- x dx 

A 7o 



The integral to the right is by definition equal to T(h(p+ 1)). □ 

After this preliminary computation, we turn to the proof of the central 
limit theorem. 

Proof. Define for fixed n > 1 the random variables 

(Xi - E[Xi\) 

so that S 1 * = Yn=iYi- Define N(0, <r 2 )-distributed random variables % 
having the property that the set of random variables 

{ Yi , . . . , Y n , Y\ , . . . Y n } 

are independent. The distribution of S n = Yi is just the normal distri- 

bution N(0, 1). In order to show the theorem, we have to prove E[/(5*)] — 
E[f(S n )] — > for any / e Cb(R). It is enough to verify it for smooth / of 
compact support. Define 

Z k = Y 1 + ... Y k _ t + Y k+1 + ■ ■ ■ + Y n . 

Note that Zy+Y\ = 5* and Z n + Y n = S n . Using first a telescopic sum 
and then Taylor's theorem, we can write 



f{S* n )-f{S n ) = Y^[f(Z k + Y k )-f(Z k + Y k )] 

k=l 

n n 1 

= ^[mxn - m + ^[-r(z fe )(r, 2 - y 2 )] 

k=l k=l 

n 

+ Y}R{Z k ,Y k ) + R{Z k ,Y k )\ 
fe=i 

with a Taylor rest term R(Z, Y), which can depend on /. We get therefore 

n 

\E[f(S* n )]-E[f(S n )}\ < ^E[\R(Z k ,Y k )\}+E{\R(Z k ,Y k )\] . (2.10) 



k=i 



Because Y k are N(0, <r 2 )-distributed, we get by lemma (2.14.2) and the 
Jensen inequality (2.5.1) 

Bonn = ^ = ^E[\Y k \ 2 f 2 < y|E[im>] . 
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Taylor's theorem gives \R(Z k , Y k )\ < const • |Yfc| 3 so that 

n n 

Y,V[\R(Zk,Y k )\]+E{\R(Z k ,Y k )\] < const -^E[|y fe | 3 ] 

k=l k=l 

13/2 



< const • n • sup ||Xi||3/Var[S'„]' ; 

i 

SU P» 11^113 1 

(Var[S„]/n)3/ 2 ' 0* 



const • 



m i cm 

< = . 

We have seen that for every smooth / <G C(,(K) there exists a constant C(/) 
such that \E[f(S*)] - E[f(S n )}\ < C(f)/y/H. □ 

if we assume the Xi to be identically distributed, we can relax the condition 
Xi e C 3 to X t G C 2 : 



Theorem 2.14.3 (Central limit theorem for IID L 2 random variables). If 
Xi G C 2 are IID and satisfy < VarfXi], then S 1 * converges weakly to a 
random variable with standard normal distribution A^(0, 1). 



Proof. The previous proof can be modified. We change the estimation of 
Taylor \R{z,y)\ < 5{y) ■ y 2 with 5{y) -> for \y\ -> 0. Using the IID 
property we can estimate the rest term 

n 

R = J2 n\R(Z k , Y k )\] + E[\R(Z k , Y k )\) 
fc=i 

as follows 

n 

r < ^E[5(y fc )y fc 2 ] + E[,5(n)y fe 2 ] 

k=l 




Both terms converge to zero for n — > oo because of the dominated con- 
vergence theorem (2.4.3): for the first term for example, &{-^j=)-^k ~~ ^ 

pointwise almost everywhere, because S(y) — ¥ and X\ G C 2 . Note also 
that the function 8 which depends on the test function / in the proof of 
the previous result is bounded so that the roof function in the dominated 
convergence theorem exists. It is CX 2 for some constant C. By (2.4.3) the 
expectation goes to zero as n —> oo. □ 
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The central limit theorem can be interpreted as a solution to a fixed point 
problem: 

Definition. Let Vq \ be the space of probability measure /i on (R, £>r) which 
have the properties that J" R x 2 d[i(x) = 1, J R x d[i(x) = 0. Define the map 



Corollary 2.14.4. The only attracting fixed point of T on "Po,i is the law of 
the standard normal distribution. 



Proof. If /i is the law of a random variables X, Y with Var[X] = Var[Y"] = 1 
and ELY] = E[Y] = 0. Then T(/x) is the law of the normalized random 
variable (X + Y)/ \/2 because the independent random variables X, Y can 
be realized on the probability space (R 2 ,2?,/U x fi) as coordinate functions 
X((x, y)) = x, Y((x, y)) = y. Then T{u) is obviously the law of (X+Y)/V2. 
Now use that T n (X) = (52")* converges in distribution to N(0, 1). □ 

For independent — 1 experiments with win probability p € (0,1), the 
central limit theorem is quite old. In this case 



as had been shown by de Moivre in 1730 in the case p — 1/2 and for general 
p £ (0, 1) by Laplace in 1812. It is a direct consequence of the central limit 
theorem: 



Corollary 2.14.5. (DeMoivre-Laplace limit theorem) The distribution of X* 
converges to the normal distribution if X n has the binomial distribution 
B(n,p). 



For more general versions of the central limit theorem, see [109]. 

The next limit theorem for discrete random variables illustrates, why the 
Poisson distribution on N is natural. Denote by B(n,p) the binomial dis- 
tribution on {1, . . . , n } and with P Q the Poisson distribution on N \ {0 }. 




on 7? ,i- 
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Theorem 2.14.6 (Poisson limit theorem). Let X n be a -B(n,p n )-distributed 
and suppose np n — > a. Then X n converges in distribution to a random 
variable X with Poisson distribution with parameter a. 



Proof. We have to show that P[X n = fc] -> P[X = k] for each fixed k £ N. 

nXn = k] = ( £ ^(l-Pn)""* 

n(n-l)(n-2)...(n-fc + l) fe/1 .__ fc 

= — £j " -K(l-Pn) n 

k\ [ Pn) [ n ' k\ ■ 

kOiILL 



Figure. The binomial 
distribution 5(2,1/2) 
has its support on 
{0,1,2}. 



Figure. The binomial 
distribution 5(5,1/5) 
has its support on 
{0,1,2,3,4,5}. 



Figure. The Pois- 
son distribution 
with a = 1 on 
N = {0,1,2,3,...}. 



Exercise. It is custom to use the notation 



$(s) = F x (s) = 




for the distribution function of a random variable X which has the standard 
normal distribution iV(0, 1). Given a sequence of IID random variables X n 
with this distribution. 

a) Justify that one can estimate for large n probabilities 
P[a < S; t < b] ~ - $(a) . 
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b) Assume Xi are all uniformly distributed random variables in [0, 1]. 
Estimate for large n 

P[\S n /n-0.5\ > e] 

in terms of e and n. 

c) Compare the result in b) with the estimate obtained in the weak law of 
large numbers. 



Exercise. Define for A > the transformation 

T x (n)(A) =([ U(^) d^x) duty) 

JR JR A 

in V = Mi(R), the set of all Borel probability measures on K. For which A 
can you describe the limit? 



2.15 Entropy of distributions 

Denote by v a (not necessarily finite) measure on a measure space (f2,_4). 
An example is the Lebesgue measure on R or the counting measure on N. 
Note that the measure is defined only on a 5-subring of A since we did not 
assume that v is finite. 



Definition. A probability measure p on R is called v absolutely continuous, 

if there exists a density / e t^iy) such that p = fv. If /U is ^-absolutely 
continuous, one writes p <C v. Call V(y) the set of all v absolutely con- 
tinuous probability measures. In other words, the set V(y) is the set of 
functions / 6 £ X (V) satisfying / > and / f(x) dv(x) = 1. 

Remark. The fact that /i <C v defined earlier is equivalent to this is called 
the Radon-Nykodym theorem (3.1.1). The function / is therefore called the 
Radon-Nykodym derivative of /i with respect to v. 

Example. If v is the counting measure N = {0,1,2,... } and v is the 
law of the geometric distribution with parameter p, then the density is 
f(k)=p(l-p) k . 

Example. If v is the Lebesgue measure on (— oo, oo) and fj, is the law of 
the standard normal distribution, then the density is f(x) = e~ x I 2 j\f 7 2^K. 
There is a multi- variable calculus trick using polar coordinates, which im- 
mediately shows that / is a density: 




e -(x 2 +y 2 )/2 dxdy = / e -r 2 /2 rdMr = ^ _ 

Jo J a 
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Definition. For any probability measure fj, S V{v) define the entropy 

H(f,)= /-/(w)Iog(/( W ))di/(w). 



JQ 

It generalizes the earlier defined Shannon entropy, where the assumption 
had been dv = dx. 

Example. Let v be the counting measure on a countable set f2, where A 
is the tr-algebra of all subsets of and let the measure v is defined on the 
5- ring of all finite subsets of fi. In this case, 



For example, for f2 = N = {0,1,2,3,... } with counting measure v, the 
geometric distribution P[{fc}] =p(l — p) k has the entropy 

£ -(i -p)Vog(a - P ) k p) = iog(i^) - . 

Example. Let v be the Lebesgue measure on R. If \i = fdx has a density 
function /, we have 

H(n)= f -f(x) Iog(/(aO) dx • 

For example, for the standard normal distribution fi with probability den- 
sity function f(x) = -^= e - x ^ 2 , the entropy is H(f) = (1 + log(27r))/2. 

Example. If v is the Lebesgue measure dx on = R + = [0, oo). A random 
variable on with probability density function f(x) = \e~ Xx is called the 
exponential distribution. It has the mean 1/A. The entropy of this distri- 
bution is (log(A) — 1)/A. 



Example. If v is a probability measure on I, / a density and 

A = {Ax,..., A n } 
is a partition on K. For the step function 



n „ 

f = Y,(] fdx)l Ai &S{v) , 



the entropy H{fv) is equal to 

H({A i }) = J2-"(A i )log(v(A i )) 

i 

which is called the entropy of the partition {Ai }. The approximation of 
the density / by a step functions / is called coarse graining and the entropy 
of / is called the coarse grained entropy. It has first been considered by 
Gibbs in 1902. 



2.15. Entropy of distributions 
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Remark. In ergodic theory, where one studies measure preserving trans- 
formations T of probability spaces, one is interested in the growth rate of 
the entropy of a partition generated by A, T(A), .., T n (A). This leads to 
the notion of an entropy of a measure preserving transformation called 
Kolmogorov- Sinai entropy. 

Interpretation. Assume that is finite and that v the counting measure 
and fi({uj}) = f(ui) the probability distribution of random variable de- 
scribing the measurement of an experiment. If the event {to} happens, then 
— log(/(a;)) is a measure for the information or "surprise" that the event 
happens. The averaged information or surprise is 

ff(M) = £-/Miog(/M). 

UJ 

If / takes only the values or 1, which means that /i is deterministic, 
then H(fi) = 0. There is no surprise then and the measurements show a 
unique value. On the other hand, if / is the uniform distribution on CI, then 
H(/j.) = log(|Q|) is larger than if O has more than one element. We will 
see in a moment that the uniform distribution is the maximal entropy. 



Definition. Given two probability measures /i = fi> and fl = fv which are 
both absolutely continuous with respect to v. Define the relative entropy 

H(Jx\n) = [ />)log(M) dv{x) G [0,00] . 

It is the expectation E^[Z] of the Likelihood coefficient I = log(j^y). The 
negative relative entropy — H{jl\n.) is also called the conditional entropy. 
We write also H(f\f) instead of H(p,\fj,). 



Theorem 2.15.1 (Gibbs inequality). < H(jl\u) < +oo and H(jl\u) = if 
and only if n. = fx. 



Proof. We can assume H(p,\[i) < oo. The function u{x) = x log(x) is convex 
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on R + = [0,oo) and satisfies u(x) > x — 1. 



/ /( W )tog(M)di/( W ) 

/H#4iog(#4)^H 

f(io)u{ f M) dv{w) 
n /(w) 

> / 1)^) 

/(w) - />) = 1-1 = 

n 



If /i = jU, then / = / almost everywhere then = 1 almost everywhere 

and H{p\fi) = 0. On the other hand, if H(jl\fj,) = 0, then by the Jensen 
inequality (2.5.1) 



= E A [u(4)] > «(E A [4]) = u(l) = . 



Therefore, E A [u(^)] = u(E A [j]). The strict convexity of u implies that j 

must be a constant almost everywhere. Since both / and / are densities, 
the equality f = f must be true almost everywhere. □ 



Remark. The relative entropy can be used to measure the distance between 
two distributions. It is not a metric although. The relative entropy is also 
known under the name Kullback-Leibler divergence or Kullback-Leibler 
metric, if v = dx [881. 
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Theorem 2.15.2 (Distributions with maximal entropy). The following dis- 
tributions have maximal entropy. 

a) If £1 is finite with counting measure v. The uniform distribution on £1 

has maximal entropy among all distributions on £7. It is unique with this 
property. 

b) fi = N with counting measure v. The geometric distribution with 
parameter p = cr x has maximal entropy among all distributions on 
N = {0, 1,2,3,...} with fixed mean c. It is unique with this property. 

c) f2 = {0,1}^ with counting measure v. The product distribution 77^, 
where 77(1) = p, n(0) = 1 — p with p = c/N has maximal entropy among all 
distributions satisfying E[5jv] = c, where Sn{^>) = J2i=i U! i- ^ is unique 
with this property. 

d) f2 = [0, 00) with Lebesgue measure v. The exponential distribution with 
density fix) = Ae~ Al ' with parameter A on £1 has the maximal entropy 
among all distributions with fixed mean c = 1/A. It is unique with this 
property. 

c) £1 = M with Lebesgue measure v. The normal distribution N(m,<r 2 ) 
has maximal entropy among all distributions with fixed mean m and fixed 
variance a 1 . It is unique with this property. 

f) Finite measures. Let (Q,A) be an arbitrary measure space for which 
< j/(£1) < 00. Then the measure v with uniform distribution / = l/v(Q) 
has maximal entropy among all other measures on CI. It is unique with this 
property. 



Proof. Let \i = fv be the measure of the distribution from which we want 
to prove maximal entropy and let fl = fv be any other measure. The aim 
is to show H(fl\jj,) = H(/j,) — H{p) which implies the maximality since by 
the Gibbs inequality lemma (2.15.1) H(jl\n) > 0. 
In general, 

H{ji\ri = -H(p,)- [ /(w)Iog(/(w))di/ 
so that in each case, we have to show 

H(n) = - f f(u)]og(f(u)) du . (2.11) 
Jn 

With 

H{p\fi = H(ji) - H{fi) 

we also have uniqueness: if two measures p, fi have maximal entropy, then 
H(fl\fj,) = so that by the Gibbs inequality lemma (2.15.1) fj, = fl. 

a) The density / = l/|f2| is constant. Therefore -ff(/i) = log(|f2|) and equa- 
tion (2.11) holds. 
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b) The geometric distribution on N = {0, 1,2,...} satisfies P[{fc}] = f(k) = 
p(l — p) k . We have computed the entropy before as 

log(l - p)/p) - (log(l - p))/p = - log(p) - log(l - p) . 

c) The discrete density is f(to) = p Sjv (l — p) N ^ s ^ so that 

log(/(fc)) = S N \og(p) + (N — S N ) log(l - p) 

and 

/>) log(/(fc)) = V[Sn] log(p) + (JV - E[S W ]) log(l - p) . 

fc 

The claim follows since we fixed E[5jv]- 

d) The density is f(x) = ae~ ax , so that log(/(a;)) = log (a) — ax. The 
claim follows since we fixed E[X] — fx dp(x) was assumed to be fixed for 
all distributions. 

e) For the normal distribution log(/(a;)) = a + b(x — m) 2 with two real 
number a, b depending only on m and a. The claim follows since we fixed 
Var[JT] = E[(a; — to) 2 ] for all distributions. 

f) The density / = 1 is constant. Therefore H(p) = which is also on the 
right hand side of equation (2.11). □ 

Remark. This result has relations to the foundations of thermodynamics, 

where one considers the phase space of N particles moving in a finite region 
in Euclidean space. The energy surface is then a compact surface f2 and the 
motion on this surface leaves a measure v invariant which is induced from 
the flow invariant Lebesgue measure. The measure v is called the micro- 
canonical ensemble. According to /) in the above, it is the measure which 
maximizes entropy. 

Remark. Let us try to get the maximal distribution using calculus of vari- 
ations. In order to find the maximum of the functional 

H(f) = - J /log(/) dv 

on C}(y) under the constraints 

F(f) = f fdv = l, G(f) = f Xfdu = c, 

we have to find the critical points of H = H — XF — pG In infinite dimen- 
sions, constrained critical points arc points, where the Lagrange equations 

§jH(f) = X±F(f)+^G(f) 

F(f) = 1 
G(f) = c 
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are satisfied. The derivative d/df is the functional derivative and A, /i are 
the Lagrange multipliers. We find (/, A, v) as a solution of the system of 
equations 

-1 - log(/(a?)) = A + /X2J, 
f f{x)dv{x) = 1, 

xf(x) dv{x) = c 

by solving the first equation for /: 

f = e —X—ftx+l 

/e— *(,) . 1 

dividing the third equation by the second, so that we can get [i from the 
equation J xe~ fJ/X x dv(x) = c J e~^ x ^ du{x) and A from the third equation 
e 1+A = J e^^ x dv(x). This variational approach produces critical points of 
the entropy. Because the Hessian D 2 (H) = — 1// is negative definite, it is 
also negative definite when restricted to the surface in C 1 determined by 
the restrictions F = 1,G = c. This indicates that we have found a global 
maximum. 

Example. For = R, X(x) = x 2 , we get the normal distribution N(0, 1). 

Example. For ft = N, X(n) = e„, we get f(n) = e -^ Xl /Z(f) with Z(f) = 
J2 n e~ e " Al and where Ai is determined by J2 n £ n e_e " Al = c - This is called 
the discrete Maxwell-Boltzmann distribution. In physics, one writes A -1 = 
kT with the Boltzmann constant k, determining T, the temperature. 



Here is a dictionary matching some notions in probability theory with cor- 
responding terms in statistical physics. The statistical physics jargon is 
often more intuitive. 



Probability theory 


Statistical mechanics 


Set ft 


Phase space 


Measure space 


Thermodynamic system 


Random variable 


Observable (for example energy) 


Probability density 


Thermodynamic state 


Entropy 


Boltzmann-Gibbs entropy 


Densities of maximal entropy 


Thermodynamic equilibria 


Central limit theorem 


Maximal entropy principle 



Distributions, which maximize the entropy possibly under some constraint 
are mathematically natural because they are critical points of a variational 
principle. Physically, they are natural, because nature prefers them. From 
the statistical mechanical point of view, the extremal properties of entropy 
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offer insight into thermodynamics, where large systems are modeled with 
statistical methods. Thermodyanamic equilibria often extremize variational 
problems in a given set of measures. 

Definition. Given a measure space (f2,»4) with a not necessarily finite 
measure v and a random variable X £ C. Given / £ C 1 leading to the 
probability measure il = fv. Consider the moment generating function 
Z{\) = E M [e AX ] and define the interval A = {A £ E | Z(A) < oo } in R. 
For every A £ A we can define a new probability measure 

fix = hv = z{xf 

on H. The set 

{Vx | A £ A } 

of measures on (fi, A) is called the exponential family defined by v and X. 



Theorem 2.15.3 (Minimizing relative entropy). For all probability measures 
jl which are absolutely continuous with respect to we have for all A £ A 

H{p\ii)- AE A [A] > -logZ(A) . 

The minimum — logZ(A) is obtained for fi\. 



Proof. For every fi = fi>, we have 

mm = I hog(l ■ f -j) dv 

Jn fx J 
= H(fi^ x ) + (-log(Z(X))+XE^[X}). 

For fj = fi\, we have 

ff(/i X | M ) = -log(Z(A))+AE /lx [X] . 

Therefore 

H(p,\ IJl )-XE il [X} = H(jl\iM X )-log(Z(\)) > -logZ(A) . 
The minimum is obtained for /2 = □ 



Corollary 2.15.4. (Minimizers for relative entropy) 

a) /i a minimizes the relative entropy fi i— > H(fi\fi) among all ^-absolutely 
continuous measures fi with fixed E^[A]. 

b) If we fix A by requiring E MA [X] = c, then fi\ maximizes the entropy 
H(fi) among all measures ft satisfying E^[A] = c. 
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Proof, a) Minimizing ft i— > H(ft\fi) under the constraint E^X] = c is equiv- 
alent to minimize 

H(fi\p) - XE fi [X\, 

and to determine the Lagrange multiplier A by E^ A [X] = c. The above 
theorem shows that fix is minimizing that, 
b) If fj, = fv, = e~ xx f/Z, then 

< H(j2, fix) = -H(ft) + (- log(Z)) - AE M [A] = -H(ft) + H(u. x ) . 

□ 



Corollary 2.15.5. If v = fi is a probability measure, then fix maximizes 
among all measures ft, which are absolutely continuous with respect to ll. 



Proof. Take /r = v. Since then / = 1, H(ft\fi) = —H(ft). The claim follows 
from the theorem since a minimum of H(ft\fi) — XE^[X] corresponds to a 
maximum of F(u). □ 

This corollary can also be proved by calculus of variations, namely by 
finding the minimum of F(f) = J /log(/) + Xf dv under the constraint 
ffdis = l. 



Remark. In statistical mechanics, the measure [i\ is called the Gibbs distri- 
bution or Gibbs canonical ensemble for the observable X and Z(X) is called 
the partition function. In physics, one uses the notation A = — (kT)^ 1 , 
where T is the temperature. Maximizing H(ii) — (fcT) _1 E M [X] is the same 
as minimizing E M [X] — kTH(fi) which is called the free energy if X is 
the Hamiltonian and E M [X] is the energy. The measure [i is the a priori 
model, the micro canonical ensemble. Adding the restriction that X has 
a specific expectation value c = E^[X] leads to the probability measure 
fix, the canonical ensemble. Wc illustrated two physical principles: nature 
maximizes entropy when the energy is fixed and minimizes the free energy, 
when energy is not fixed. 

Example. Take on the real line the Hamiltonian X(x) = x 2 and a measure 
fi = fdx, we get the energy J x 2 dfi. Among all symmetric distributions 
fixing the energy, the Gaussian distribution maximizes the entropy. 

Example. Let il = N = {0,1,2,... } and X(k) = k and let v be the 
counting measure on H and fi the Poisson measure with parameter 1. The 
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partition function is 



Z(A)=]Te^— = exp(e A -l) 
fe 



so that A = R and p\ is given by the weights 

— 1 k 

p X (k) = cxp(-e- A + l)e Afc ^- = e""^ , 

where a = e A . The exponential family of the Poisson measure is the family 
of all Poisson measures. 

Example. The geometric distribution on N = {0,1,2,3,... } is an expo- 
nential family. 

Example. The product measure on f2 = {0, 1 } N with win probability p is 
an exponential family with respect to X{k) = fc. 

Example. f2 = {1, . . . , N}, v the counting measure and let p p be the bino- 
mial distribution with p. Take p = p,\ti and X(k) = k. Since 

< H(jl\^=H(fi\ f i p )+log(p)E[X]+log(l- P )E[(N-E[X])] 
= ~H(fl\fip) + H(p p ) , 
Hp is an exponential family. 

Remark. There is an obvious generalization of the maximum entropy prin- 
ciple to the case, when we have finitely many random variables 
Given /j, = fv we define the (n-dimcnsional) exponential family 

AtA = hv = zjx) ^ ' 

where 

Z(\)=E fl [e^^ x <] 
is the partition function defined on a subset A of R™. 



Theorem 2.15.6. For all probability measures p which are absolutely con- 
tinuous with respect to v, we have for all A e A 

H(p\p) - ^[X t ] > - IogZ(A) . 

i 

The minimum — logZ(A) is obtained for p\. If we fix Xi by requiring 
[Xt] = a, then p\ maximizes the entropy H(jl) among all measures 
p, satisfying E^Xi] = a. 

Assume v = p is a probability measure. The measure p\ maximizes 



F(fi) = H(p) + AE A [X] 
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Proof. Take the same proofs as before by replacing XX with A • X = 

2.16 Markov operators 

Definition. Given a not necessarily finite probability space (il,A,v). A 
linear operator P : £ 1 (f2) — » £ x (ri) is called a Markov operator, if 



Pl = l, 




f > Pf > 0, 




/> o=H|p/||i = II 


/Hi- 



Remark. In other words, a Markov operator P has to leave the closed 
positive cone invariant C\ = {/g£ 1 |/>0} and preserve the norm on 
that cone. 

Remark. A Markov operator on (il,A,v) leaves invariant the set T)(y) = 
{/ € C 1 | / > 0, H/lli = 1 } of probability densities. They correspond 
bijectively to the set V{u) of probability measures which are absolutely 
continuous with respect to v. A Markov operator is therefore also called a 
stochastic operator. 



Example. Let T be a measure preserving transformation on (fl,A, v). It is 
called nonsingular if T*v is absolutely continuous with respect to v. The 
unique operator P : C 1 — > C 1 satisfying 

f Pfdu= f fdu 

J A JT- 1 A 

is called the Perron-Frobenius operator associated to T. It is a Markov 
operator. Closely related is the operator Pf(x) = f(Tx) for measure pre- 
serving invcrtible transformations. This Koopman operator is often studied 
on C 2 , but it becomes a Markov operator when considered as a transfor- 
mation on C . 



Exercise. Assume fl — [0, 1] with Lebesgue measure [i. Verify that the 
Perron-Frobenius operator for the tent map 

/ 2x ,xe [0,1/2] 
2(1 -x) , are [1/2,1] 

is P f( x ) = l(f(l x ) + f(l-l x )). 



Here is an abstract version of the Jensen inequality (2.5.1). It is due to M. 
Kuczma. See [63]. 
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Theorem 2.16.1 (Jensen inequality for positive operators). Given a convex 
function u and an operator P : C 1 — > C 1 mapping positive functions into 
positive functions satisfying PI = 1, then 

u(Pf) < Pu(f) 

for all / G C\ for which Pu(f) exists. 



Proof. We have to show u(Pf)(u>) < Pu(f)(u>) for almost all ui G £1. Given 
x = (P/)(w), there exists by definition of convexity a linear function y t—¥ 
ay + b such that u(x) = ax + b and u(y) > ay + b for all y G R. Therefore, 
since af + b < u(f) and P is positive 

u(P/)(w) = o(P/)(w) + & = P(a/ + &)H < P(«(/))(w) . 

□ 

The following theorem states that relative entropy does not increase along 
orbits of Markov operators. The assumption that {/ > } is mapped into 
itself is actually not necessary, but simplifies the proof. 



Theorem 2.16.2 (Voigt, 1981). Given a Markov operator V which maps 
{/ > } into itself. For all f,g G £+, 

H(Vf\Pg) < H(f\g) . 

Especially, since H(f\l) = —H(f) is the entropy, a Markov operator does 
not decrease entropy: 

H(Vf) > H(f) . 



Proof. We can assume that {g(u)) = } c A = {/(w) = } because 
nothing is to show in the case H(f\g) = oo. By restriction to the measure 
space (A C ,A fl A c ,v(- n A c )), we can assume / > 0,g > so that by our 
assumption also Pf > and Pg > 0. 

(i) Assume first (f /g)(u>) < c for some constant cel. 
For fixed g, the linear operator Rh = P[hg) / P{g) maps positive functions 
into positive functions. Take the convex function u{x) — x log(x) and put 
h = fjg. Using Jensen's inequality, we get 

_ log — = u(Rh) < Ru(h) = — 
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which is equivalent to Pf log ^ < P(flog(f/g)). Integration gives 

H(Pf\Pg) = Jpflog^-dv 

< f P(f log(//.9)) dv = ( f log(//<?) dv = H(f\g) . 



(ii) Define fk — ini(f,kg) so that fk/g < k. We have fk < fk+i and 
fk ^ f in /I 1 . From (i) wc know that H(Pf k \Pg) < H(f k \g). We can 
assume H(f\g) < oo because the result is trivially true in the other case. 
Define B = {/ < .g}. On B, we have / fc \og(f k /g) = flog(f/g) and on 
we have 

A- log(/fe/.g) < /fe+i log(/ fe+ i/ 5 )u -> /log(//.g) 
so that by Lebesgue dominated convergence theorem (2.4.3), 

F(/|<?)= lim ff(/ fc | 5 ) . 

A;— >oo 

As an increasing sequence, Pfk converges to Pf almost everywhere. The 
elementary inequality x log(x) — x > x \og(y) — y for all x > y > gives 

{Pfk) log(PA) - (Pf k ) log(Pg) - (Pfk) + (Pg) > . 

Integration gives with Fatou's lemma (2.4.2) 

H(Pf\Pg) - \\Pf\\ + \\Pg\\ < fiminf H(Pf k \Pg) - \\Pf k \\ + \\Pg\\ 

and so H(Pf\Pg) < liminf^ H(Pf k \Pg). □ 



Corollary 2.16.3. For an invertible Markov operator V, the relative entropy 
is constant: H(Vf\Vg) = H(f\g). 



Proof. Because V and V 1 are both Markov operators, 

H(f\g) = HiVV-'flVV^g) < H^ftV^g) < H(f\g) . 

□ 

Example. If a measure preserving transformation T is invertible, then the 
corresponding Koopman operator and Perron- Frobenius operators preserve 
relative entropy. 



Corollary 2.16.4. The operator T(fx)(A) = f R2 l A (^) dfi(x) dfi(y) does 
not decrease entropy. 
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Proof. Denote by a random variable having the law il and with n{X) 
the law of a random variable. For a fixed random variable Y , we define the 
operator 

P y(/i ) =/ x(^+I). 

It is a Markov operator. By Voigt's theorem (2.16.2), the operator Py docs 
not decrease entropy. Since every Py has this property, also the nonlinear 
map T(n) = Px (m) shares this property. □ 

We have shown as a corollary of the central limit theorem that T has a 
unique fixed point attracting all of 'Po,i- The entropy is also strictly in- 
creasing at infinitely many points of the orbit T n ([i) since it converges to 
the fixed point with maximal entropy. It follows that T is not invertible. 



More generally: given a sequence X„ of IID random variables. For every n, 
the map P n which maps the law of S* into the law of S^ +1 is a Markov 
operator which does not increase entropy. We can summarize: summing up 
IID random variables tends to increase the entropy of the distributions. 
A fixed point of a Markov operator is called a stationary state or in more 
physical language a thermodynamic equilibrium. Important questions are: 
is there a thermodynamic equilibrium for a given Markov operator V and 
if yes, how many are there? 



2.17 Characteristic functions 

Distribution functions are in general not so easy to deal with, as for ex- 
ample, when summing up independent random variables. It is therefore 
convenient to deal with its Fourier transforms, the characteristic functions. 
It is an important topic by itself [62]. 

Definition. Given a random variable X, its characteristic function is a real- 
valued function on R defined as 

<(> x (u)=E[e iuX ] . 

If Fx is the distribution function of X and [ix its law, the characteristic 
function of X is the Fourier-Stieltjes transform 

M*) = / eUX dF x( x ) = I ^ ^x{dx) . 
it it 

Remark. If Fx is a continuous distribution function dFx{x) = fx{x) dx, 
then <px is the Fourier transform of the density function fx- 

[ e Ux f x (x) dx . 



Remark. By definition, characteristic functions are Fourier transforms of 
probability measures: if [i is the law of X, then <px = A- 
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Example. For a random variable with density fx{x) = x m /(m + 1) on 
il = [0, 1] the characteristic function is 

eWm = f 1 e itx x m dx/(m + 1) = TO! d ~ e^e^-Q) 
where e„(x) = Y?k=o x /(^O ^ s the n '^ n partial exponential function. 



Theorem 2.17.1 (Levy formula). The characteristic function 0x determines 
the distribution of X. If a, b are points of continuity of F, then 



i r 00 e — tto _ e — »to 

-F x (a) = — * • (2-12) 

In general, one has 

1 f°° e~ ita — e~ ?; * b 1 1 

4> x {t) dt = n[(a, b)} + -fi[{a}} + -fJ,[{b}} . 



2?r it w 2 2 



Proof. Because a distribution function i* 1 has only countably many points of 
discontinuities, it is enough to determine F(b) — F(a) in terms of <p if a and 
b are continuity points of F. The verification of the Levy formula is then 
a computation. For continuous distributions with density F' x = fx is the 
inverse formula for the Fourier transform: fx (a) = w- e~ tta 4>x (t) dt 

so that Fx (a) = e _.. 4>x(t) dt. This proves the inversion formula 

if a and b are points of continuity. 

The general formula needs only to be verified when \i is a point measure 
at the boundary of the interval. By linearity, one can assume /x is located 
on a single point b with p = P[X = b] > 0. The Fourier transform of the 
Dirac measure pSb is <t>x{t) = pe lfb . The claim reduces to 

1 r°° p—ita _ p —itb „ 

7T- / pe dt = — 

2nJ_ 00 it v 2 

which is equivalent to the claim lim^^co J R R e i ~ 1 dt = tt for c > 0. 
Because the imaginary part is zero for every R by symmetry, only 

f R sin(ic) , 
lim / — — dt = tt 



remains. The verification of this integral is a prototype computation in 
residue calculus. □ 



118 



Chapter 2. Limit theorems 



Theorem 2.17.2 (Characterization of weak convergence). A sequence X n 
of random variables converges weakly to X if and only if its characteristic 
functions converge point wise: 

4>x n (x) -> (j>x ■ 



Proof. Because the exponential function e ltx is continuous for each i, it 
follows from the definition that weak convergence implies the point wise 
convergence of the characteristic functions. From formula (2.12) follows 
that if the characteristic functions converge point wise, then convergence 
in distribution takes place. We have learned in lemma (2.13.2) that weak 
convergence is equivalent to convergence in distribution. □ 



Example. Here is a table of characteristic functions (CF) <t>x(f) = E[e z * x ] 
and moment generating functions (MGF) Mx(t) = E[e tx ] for some familiar 
random variables: 



Distribution 


Parameter 


CF 


MGF 


Normal 


m e K, cr 2 > 


mit-a i i 1 /2 




N(0,1) 




e -f/2 


e t-/a 


Uniform 


[-a, a] 


sin(at)/(ai) 


sinh(at)/(at) 


Exponential 


A > 


\/{\-it) 


A/(A-t) 


binomial 


n> l,pe [0,1] 


(1 -p + pe lt ) n 


(l-p + pe*)" 


Poisson 


A > 0, A 


e A(e"-l) 




Geometric 


?e (0,1) 


(l-(l-p)e" 


V 

(l-(l-p)e* 


first success 


P€(0,1) 


pe 1 

(l-(l-p)e" 


pe 

(l-(X-p)e* 


Cauchy 


m e M, b > 




e mi— |t| 



Definition. Let F and G be two probability distribution functions. Their 
convolution F * G is defined as 

F*G(x)= f F(x ~ y) dG(y) . 



Lemma 2.17.3. If F and G are distribution functions, then F *G is again 
a distribution function. 



Proof. We have to verify the three properties which characterize distribu- 
tion functions among real-valued functions as in proposition (2.12.1). 
a) Since F is nondecreasing, also F * G is nondecreasing. 
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b) Because F(— oo) = we have also F*G(— oo) = 0. Since F(po) = 1 and 
dG is a probability measure, also F*G(oc) = 1. 

c) Given a sequence /i„ — > 0. Define F n (x) = F(x + h„). Because F 
is continuous from the right, F n {x) converges point wise to F{x). The 
Lebesgue dominated convergence theorem (2.4.3) implies that F n *G{x) = 
F * G(x + h n ) converges to F * G(x). □ 

Example. Given two discrete distributions 



Then F-kG(x) = ^2 n<x (p*q)n, where p*q is the convolution of the sequences 
p,q defined by (p * q) n = X^L P&^n-fc- We see that the convolution of 
discrete distributions gives again a discrete distribution. 

Example. Given two continuous distributions F,G with densities h and k. 
Then the distribution of F * G is given by the convolution 



Lemma 2.17 A. If F and G arc distribution functions with characteristic 
functions </> and i/>, then F * G has the characteristic function 4> ■ ip. 



Proof. While one can deduce this fact directly from Fourier theory, we 
prove it by hand: use an approximation of the integral by step functions: 



F ( x ) = ^2Pn, G{x) = ^2 ?i 




because 






k - 1 



y)] ■ e™y dG(y) 




(j)(u)ip(u) . 



□ 
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It follows that the set of distribution functions forms an associative com- 
mutative group with respect to the convolution multiplication. The reason 
is that the characteristic functions have this property with point wise mul- 
tiplication. 

Characteristic functions become especially useful, if one deals with inde- 
pendent random variables. Their characteristic functions multiply: 



Proposition 2.17.5. Given a finite set of independent random variables 
Xj , j = 1,. . .,n with characteristic functions <j>j. The characteristic func- 
tion of Xj is <p = n™ = i hi- 



proof. Since Xj are independent, we get for any set of complex valued 
measurable functions <?j, for which E[gj(Xj)] exists: 

n n 

E[n.^(^)]=ii E fe(^)]- 

Proof: This follows almost immediately from the definition of independence 
since one can check it first for functions gj = 1a-, where Aj are cr(Xj 
measurable functions for which gj(Xj)g k (X k ) = lA,nA fc and 

E[gj(Xj)g k (X k )]=m(Aj)m(A k )=E[gj(Xj)]E[g k (X k )] , 

then for step functions by linearity and then for arbitrary measurable func- 
tions. 

If we put gj(x) = exp(ix), the proposition is proved. □ 

Example. If X n are IID random variables which take the values and 2 with 
probability 1/2 each, the random variable X = Xi^Li ^n/3 n is a random 
variable with the Cantor distribution. Because the characteristic function 
of X n is (f> Xn /3^(t) = E[e** x "/ 3 "] = - — — — , we see that the characteristic 
function of X is 

°° i2/3" _ 1 

<f>x(t) = U 2 ■ 

The centered random variable Y = X — 1/2 can be written as Y = 
SJS=i Yi/3 n , where Y n takes values —1, 1 with probability 1/2. So 

i/3 n i — i/3 n 00 + 

Mt) - nne^n = n c —^ — = n ^ • 



This formula for the Fourier transform of a singular continuous measure /i 
has already been derived by Wiener. The Fourier theory of fractal measures 
has been developed much more since then. 
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Figure. The characteristic func- 
tion 4>Y(t) of a random variable 
Y with a centered Cantor distri- 
bution supported on [—1/2,1/2] 
has an explicit formula 4>Y{t) = 
n^Li cos (^r) an d already been 
derived by Wiener in the early 
20'th century. The formula can 
also be used to compute moments 
of Y with the moment formula 
E[X m ] = (-i) m |^x(*)lt=o- 
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Corollary 2.17.6. The probability density of the sum of independent ran- 
dom variables Yl^—i Xj is fi * fi * • • ■ -k f n , if Xj has the density fj. 



Proof. This follows immediately from proposition (2.17.5) and the alge- 
braic isomorphisms between the algebra of characteristic functions with 
convolution product and the algebra of distribution functions with point 
wise multiplication. □ 

Example. Let Yk be IID random variables and let Xu = A fe Yfe with < A < 
1. The process S n = Sfc=i Xk is called the random walk with variable step 
size or the branching random walk with exponentially decreasing steps. Let 
\i be the law of the random sum X = Yl^Li Xk- If </>y (i) is the characteristic 
function of Y, then the characteristic function of X is 

oo 

fe(t) = ]J^ x (tX n ) . 

n=l 

For example, if the random Y n take values —1,1 with probability 1/2, where 
4>Y(t) = cos(t), then 

oo 

<t>x(t) = n c ° s (* A ") • 

n=l 

The measure fi is then called a Bernoulli convolution. For example, for 
A = 1/3, the measure is supported on the Cantor set as we have seen 
above. For more information on this stochastic process and the properties 
of the measure fi which in a subtle way depends on A, see [42]. 



Exercise. The characteristic function of a vector valued random variable 
X = (Xi, . . . , Xk) is the real- valued function 



cj )x (t)=E[e* t - x ] 
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on M. k , where we wrote t = (ti, . . . , tk). Two such random variables X, Y 
are independent, if the cr-algebras X~ 1 (B) and Y~ 1 (B) are independent, 
where B is the Borel cr-algebra on M. k . 

a) Show that if X and Y are independent then <f>x+Y = <t>x • <\>y ■ 

b) Given a real nonsingular k x k matrix A called the covariance matrix 
and a vector m = (mi, . . . ,rrik) called the mean of X. We say, a vector 
valued random variable X has a Gaussian distribution with covariance A 
and mean m, if 

(t> x (t) =e "n-t-±(t-At) 

Show that the sum X + Y of two Gaussian distributed random variables is 
again Gaussian distributed. 

c) Find the probability density of a Gaussian distributed random variable 
X with covariance matrix A and mean to. 



Exercise. The Laplace transform of a positive random variable X > is 
defined as lx(t) = E[e _tx ]. The moment generating function is defined as 
M(t) = E[e* x ] provided that the expectation exists in a neighborhood of 
0. The generating function of an integer- valued random variable is defined 
as £(X) = F,[u x ] for u G (0, 1). What does independence of two random 
variables X, Y mean in terms of (i) the Laplace transform, (ii) the moment 
generating function or (iii) the generating function? 



Exercise. Let (Q.,A,fj,) be a probability space and let U, V £ X be ran- 
dom variables (describing the energy density and the mass density of a 
thermodynamical system). We have seen that the Helmholtz free energy 

E A [(7] - kTH\fi] 

(k is a physical constant), T is the temperature, is taking its minimum for 
the exponential family. Find the measure minimizing the free enthalpy or 
Gibbs potential 

E/i[f] — kTH[fi\ — pE^[V] , 

where p is the pressure. 



Exercise. Let (f2, A, p) be a probability space and Xi € C random variables. 
Compute E^[-Xj] and the entropy of p\ in terms of the partition function 
Z{\). 
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Exercise, a) Given the discrete measure space (f2 = {eo + n5},u), with 
eo € K. and S > and where v is the counting measure and let X(k) = k. 
Find the distribution / maximizing the entropy H(f) among all measures 
p, = fv fixing Eji[X] = e. 

b) The physical interpretation is as follows: il is the discrete set of ener- 
gies of a harmonic oscillator, eo is the ground state energy, S — hu is the 
incremental energy, where u> is the frequency of the oscillation and h is 
Planck's constant. X(k) = k is the Hamiltonian and ELY] is the energy. 
Put A = 1/fcT, where T is the temperature (in the answer of a), there ap- 
pears a parameter A, the Lagrange multiplier of the variational problem). 
Since can fix also the temperature T instead of the energy e, the distribu- 
tion in a) maximizing the entropy is determined by lo and T. Compute the 
spectrum e(oj,T) of the blackbody radiation defined by 

e(w,T) = (E[X]-eo)-^ 

where c is the velocity of light. You have deduced then Planck's blackbody 
radiation formula. 



2.18 The law of the iterated logarithm 

We will give only a proof of the law of iterated logarithm in the special 
case, when the random variables X n are independent and have all the 
standard normal distribution. The proof of the theorem for general IID 
random variables X n can be found for example in [109]. The central limit 
theorem makes the general result plausible when knowing this special case. 

Definition. A random variable X 6 C is called symmetric if its law \xx 

satisfies: 

H((-b,-o)) = fj,([a,b)) 

for all a < b. A symmetric random variable X G C has zero mean. We 
again use the notation S n = X)fe=i Xk in this section. 



Lemma 2.18.1. Let X n by symmetric and independent. For every e > 
P[ max S k > e] < 2P[5„ > e] . 

Kk<n 



Proof. This is a direct consequence of Levy's theorem (2.11.6) because we 
can take m = as the median of a symmetric distribution. □ 

Definition. Define for n > 2 the constants A„ = v / 2nlog logn. It grows only 
slightly faster than v2ri. For example, in order that the factor Vlog log n 
is 3, we already have n = exp(cxp(9)) > 1.33 • 10 3519 . 
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Theorem 2.18.2 (Law of iterated logarithm for N(0, 1)). Let X n be a se- 
quence of IID N(0, l)-distributed random variables. Then 

lim sup — — = 1 , lim inf — — = — 1 . 



Proof. We follow [48] . Because the second statement follows obviously from 
the first one by replacing X„ by — X n , we have only to prove 

lim sup S n /A n = 1 . 

(i) P[S n > (1 + e)A„, infinitely often] = for all e > 0. 

Define n k — [(1 + e) ] G N, where [x] is the integer part of x and the events 

A k = {S n > (1 + e)A„,for some n £ (n k ,n k+1 ] }. 

Clearly limsup fc A k = {S n > (l + e)A„, infinitely often}. By the first Borel- 
Cantelli lemma (2.2.2), it is enough to show that J2k ^[Ak] < oo. For each 
large enough fc, we get with the above lemma 

P[A k ] < P[ max 5„>(l + e)A fc ] 

n k <n<n k+1 

< P[ max S n > (1 + e)A fc ] 

l<™<"fc+l 

< 2P[5„ fc+1 > (1 + e)A k ] . 

The right-hand side can be estimated further using that S nk+1 / »/rife+i 
is N(0, l)-distributed and that for a N(0, l)-distributed random variable 
P[X > t] < const • e~* 2/2 



Sn k+i ^ h , ^V^k log log n fc 



2P[S nk+1 >(l + e)A fe ] = 2P[ >(l + e) 

< Cexp(--(l + e) ) ) 

2 rik+i 

< Ci exp(-(l + e) loglog(ra fc )) 

= CiiogK)-( 1+e ) <c 2 k-^ . 

Having shown that P[Aj,] < const • fc~( 1+e ) for large enough k proves the 
claim J2k ~P[Ak] < oo. 

(ii) P[S n > (1 - e)A„, infinitely often] = 1 for all e > 0. 
It suffices to show, that for all e > 0, there exists a subsequence n k 
P[S nk > (1 — e)A„ fc , infinitely often] = 1 . 
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Given e > 0. Choose N > 1 large enough and c < 1 near enough to 1 such 

that 

ci/I - 1JN-2/VN> 1 - e . (2.13) 

Define = iV fc and Anj. = rifc — The sets 

Ak = {S nk - S nk _ 1 > c^2An k log log An k } 

are independent. In the following estimate, we use the fact that J t °° e~ x2 1 2 dx > 
C ■ e - * 2 / 2 for some constant C. 



P{A k ] = P[{S nk - > cyj2An k loglogAn fc }] 

_ p r| Sri* - S nk -! > c s/2An k log log Anfc ^ 

> C ■ exp(-c 2 log log An k ) < C ■ exp(-c 2 log(fc log N)) 
= d ■ cxp(-c 2 log k) = C\k- c2 

so that J2 k P[A k ] = co. We have therefore by Borcl-Cantelli a set A of full 
measure so that for tu G A 



Sn k - S rik _ 1 > cyj2An k log log An k 
for infinitely many k. From (i), we know that 



S„ k > -2^2n k log log 



»/,■ 



for sufficiently large fc. Both inequalities hold therefore for infinitely many 
values of k. For such k, 

S nk (uj) > 5„ fc _ 1 (w) + cv/2An fc log log An fe 



> - 2 \]2n k - 1 log log rife- x + c An fc log log An k 

> (-2/VN + cy/\-l/N) y/2n k log log n k 

> (1 - e) ^J2n k log log n k , 

where we have used assumption (2.13) in the last inequality. □ 

We know that N(0, 1) is the unique fixed point of the map T by the central 
limit theorem. The law of iterated logarithm is true for T(X) implies that 
it is true for X. This shows that it would be enough to prove the theorem 
in the case when X has distribution in an arbitrary small neighborhood of 
N(0, 1). We would need however sharper estimates. 

We present a second proof of the central limit theorem in the IID case, to 
illustrate the use of characteristic functions. 



Theorem 2.18.3 (Central limit theorem for IID random variables). Given 
X n G C 2 which are IID with mean and finite variance a 2 . Then 
S n l(?\fn) — > N(0, 1) in distribution. 



126 



Chapter 2. Limit theorems 



Proof. The characteristic function of N(0, 1) is <j>{£) — e ^ I 2 . We have to 
show that for all i £ M. 

Denote by <px n the characteristic function of X n . Since by assumption 
E[X n ] = &nd~E[X 2 ] = cr 2 , we have 



Therefore 



4>X n {t) = l- yt 2 + 0(t 2 ) 



1 1 2 1 



0(1). 



□ 



This method can be adapted to other situations as the following example 
shows. 



Proposition 2.18.4. Given a sequence of independent events A„ C ft with 
P[A n ] = 1/n. Define the random variables X„ = lyt„ and S„ = Y^k=i ^k- 
Then 

T = gn ~ log(") 
converges to N(0, 1) in distribution. 



Proof. 

n , 

E[Sn] = ^T=log(n)+ 7 + o(l), 
fc=l 

where 7 = lirrin^oo Y^k=i h ~ l°g( n ) i s the Euler constant. 

™ 1 1 ^2 
Var[S„] = £(!-) = log(n) + 7 - y + . 
fe=i 

satisfy E[T„] and Var[T„] ->• 1. Compute <£x n = 1 - £ + ^ so that 
= nLiC 1 - i + t) and 0r n (t) = 0s„(s(i))e— lo s("), where s = 
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1 1 v/log(n) . For n — ¥ oo , we compute 

n i 

log^r„(i) = -«V^)+X)log(l + -(e"-l)) 



fe=i 



11 

-itVbg(n) + ^ lo g(! + l( is ~ + ° (s2))) 
fe=i 

+ 53 + - s 2 + (.s 2 )) + 0(53 p ) 

fc=l k=l 
1 



= -iVlog(n) + (is - -s 2 + o(s 2 )) (log(n) + 0(1)) + t 2 0(l) 

= t*+«i>— ¥■ 

We see that T n converges in law to the standard normal distribution. □ 
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Chapter 3 

Discrete Stochastic Processes 



3.1 Conditional Expectation 

Definition. Given a probability space (Cl,A,P). A second measure P' on 
(CI, A) is called absolutely continuous with respect to P, if P[A] = implies 
P'[A] = for all A e A. One writes P' < P. 

Example. If P[a, b] = b — a is the uniform distribution on SI = [0, 1] and A 
is the Borel cr-algebra, and Y <G C 1 satisfies Y(x) > for all x G ft, then 
P'[a, b] = Y(x) dx is absolutely continuous with respect to P. 

Example. Assume P is again the Lebesgue measure on [0, 1] as in the last 
example. l£Y(x) = l B (x), then P'[A] = P[Af]B] for all A e A. If P[B] < 1, 
then P is not absolutely continuous with respect to P'. We have P'[B C ] = 
but P[B C ] = 1 - P[B] > 0. 



with respect to P. For B = {1/2}, we have P[B] = but P'[B] = 1^0. 

The next theorem is a reformulation of a classical theorem of Radon- 
Nykodym of 1913 and 1930. 



Theorem 3.1.1 (Radon-Nykodym equivalent). Given a measure P' which 
is absolutely continuous with respect to P, then there exists a unique 
Y e ^(P) with P' = YP. The function Y is called the Radon-Nykodym 
derivative of P' with respect to P. It is unique in L 1 . 



Proof. We can assume without loss of generality that P' is a positive mea- 
sure (do else the Hahn decomposition P = P + — P~), where P + and P~ 




1/2 e A 
1/2 £ A 



, then P' 



is not absolutely continuous 
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are positive measures). 

(i) Construction: We recall the notation E[Y; A] = E[l A Y] = J A Y dP. 
The set T = {Y > | E[Y; A] < P'[A], VA g A } is closed under formation 
of suprcma 



and contains a function Y different from since else, P' would be singular 
with respect to P according to the definition given in section (2.12) of abso- 
lute continuity. We claim that the supremum Y of all functions T satisfies 
YP = P': an application of Beppo-Levi's theorem (2.4.1) shows that the 
supremum of T is in T. The measure P" = P' — YP is the zero measure 
since we could do the same argument with a new set T for the absolutely 
continuous part of P". 

(ii) Uniqueness: assume there exist two derivatives Y,Y'. One has then 
E[Y — Y';{Y > Y'}] = and so Y > Y' almost everywhere. A similar 
argument gives Y' < Y almost everywhere, so that Y = Y' almost every- 
where. In other words, Y = Y' in L . □ 



Theorem 3.1.2 (Existence of conditional expectation, Kolmogorov 1933). 
Given X g £*(A) and a sub cr-algebra B C A. There exists a random 
variable Y g C\B) with J A Y dP = f A X dP for all A g B. 



Proof. Define the measures P[A] = P[A] and P'[A] = J A X dP = E[X; A] 
on the probability space {VI, B). Given a set B g B with P[B] = 0, then 
P'[B] = so that P' is absolutely continuous with respect to P. Radon- 
Nykodym's theorem (3.1.1) provides us with a random variable Y g £ 1 (K) 



Definition. The random variable Y in this theorem is denoted with E[X|£>] 
and called the conditional expectation of X with respect to B. The random 
variable Y g £ X (B) is unique in L 1 ^). If Z is a random variable, then 
E[X|Z] is defined as E[X|er(Z)]. If {Z}j is a family of random variables, 
then E^Zj-j] is defined as E[X\a{{Z}j)}. 



Example. If B is the trivial cr-algebra B = {0, 0}, then E[X\B] = E[X]. 
Example. If B = A, then E[X\B]=X. 
Example. If B = {0, Y, Y c , 0} then 



E[Yi V Y 2 ;A] 



= E[y i; A n {Y 1 > y 2 }} + E[y 2; A n {y 2 > yj] 

< P'[A n {Y l > Y 2 }} + P'[A n {Y 2 > Fx}] = P'[A] 



with P'[A] = J A XdP = J A YdP. 



a 
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Example. Let (Q,A,V) = ([0,1] x [0, 1], A, dxdy), where A is the Borel 
cr-algebra defined by the Euclidean distance metric on the square f2. Let B 
be the cr-algebra of sets Ax [0, 1], where A is in the Borel cr-algebra of the 
interval [0, 1]. If X(x, y) is a random variable on fJ, then Y = E[X\B] is the 
random variable 

Y(x,y) = / X{x,y) dy . 
Jo 

This conditional integral only depends on x. 

Remark. This notion of conditional expectation will be important later. 
Here is a possible interpretation of conditional expectation: for an exper- 
iment, the possible outcomes are modeled by a probability space (£l,A) 
which is our " laboratory" . Assume that the only information about the 
experiment are the events in a subalgcbra B of A. It models the " knowl- 
edge" obtained from some measurements we can do in the laboratory and 
B is generated by a set of random variables {Zi\ i( zZ obtained from some 
measuring devices. With respect to these measurements, our best knowl- 
edge of the random variable X is the conditional expectation ELY|£>]. It is 
a random variable which is a function of the measurements Zj. For a spe- 
cific "experiment u>, the conditional expectation E[X|S](w) is the expected 
value of X(to), conditioned to the cr-algebra B which contains the events 
singled out by data from X,-. 



Proposition 3.1.3. The conditional expectation X i-> ELY|K] is the projec- 
tion from C 2 (A) onto C 2 {B). 



Proof. The space C 2 (B) of square integrable B- measurable functions is a 
linear subspace of C 2 (A). When identifying functions which agree almost 
everywhere, then L 2 (B) is a Hilbert space which is a linear subspace of the 
Hilbert space L 2 (A). For any X <E C 2 (A), there exists a unique projection 
p(X) <E C 2 (B). The orthogonal complement £ 2 (S) ± is defined as 

C 2 ^) 1 - = {Ze £ 2 {A) | (Z, Y) := E[Z ■Y]=0 for all Y € C 2 {B) } . 

By the definition of the conditional expectation, we have for A £ B 

(X - E[X\B], l A ) = ELY - E[X\B]- A] = . 

Therefore X -E[X\B] E C 2 {B)^. Because the map q(X) = E[X\B] satisfies 
q 2 = q, it is linear and has the property that (1 — q){X) is perpendicular 
to C 2 (B), the map q is a projection which must agree with p. □ 

Example. Let = {1, 2, 3, 4 } and A the cr-algebra of all subsets of fi. Let 
B = {0, {1, 2}, {3, 4}, n}. What is the conditional expectation Y = E[X\B] 
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of the random variable X(k) = k 2 l The Hilbert space £ 2 {A) is the four- 
dimensional space R because a random variable X is now just a vector 
X = (X(1),X(2),X(3),X(4)). The Hilbert space C 2 (B) is the set of all 
vectors v = (vx, «2, «3, U4) for which v\ = t>2 and ^3 = because functions 
which would not be constant in (v±, V2) would generate a finer algebra. It is 
the two-dimensional subspacc of all vectors {v = (a,a,b,b) | a, b € R }. The 
conditional expectation projects onto that plane. The first two components 
(X(l), X(2)) project to ( x{1) ^ (2 \ X(1 H5 (2) ), the second two components 

project to 

x(3)+x(4 ) y ThereforC; 



FTvlRl r X ( 1 )+ X ( 2 ) ^(!)+^( 2 ) X(3) + X(4) X(3)+X(4) , 

miB] = ( — 71 — ' — 7! — ' — 71 — ' — 71 — ' 



Remark. This proposition 3.1.3 means that Y is the least-squares best im- 
measurable square intcgrablc predictor. This makes conditional expectation 
important for controlling processes. If B is the er-algebra describing the 
knowledge about a process (like for example the data which a pilot knows 
about an plane) and X is the random variable (which could be the actual 
data of the flying plane), we want to know, then E[X|S] is the best guess 
about this random variable, we can make with our knowledge. 



Exercise. Given two independent random variables X, Y € C 2 such that X 
has the Poisson distribution Pa and Y has the Poisson distribution P M . The 
random variable Z — X + Y has Poisson distribution Pa+^ as can be seen 
with the help of characteristic functions. Let B be the cr-algebra generated 
by Z . Show that 

EOT* - j^Z . 



Hint: It is enough to show 



E[X:{Z = k}] = -^P[Z = k] 
At 



Even if random variables are only in £ , the next list of properties of 
conditional expectation can be remembered better with proposition 3.1.3 
in mind which identifies conditional expectation as a projection, if they are 
in C 2 . 
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Theorem 3.1.4 (Properties of conditional expectation). For given random 
variables X, X n ,Y £ £, the following properties hold: 

(1) Linearity: The map X i-> E[Jf|B] is linear. 

(2) Positivity: X > =>■ E[X\B] > 0. 

(3) Tower property: C C B C A ^ E[E[X\B]\C] = E[X\C]. 

(4) Conditional Fatou: \X n \ < X, Epiminfn^oo X n \B] < 
liminf„_ ) . 00 E[X„|B]. 

(5) Conditional dominated convergence: \X n \ < X,X n — > X a.c. 
=S> E{X n \B] ->E[X\B] a.e. 

(6) Conditional Jensen: if h is convex, then E[h(X)\B] > h{E[X\B]). 
Especially ||E[X|B]|| P < \\X\\ P . 

(7) Extracting knowledge: For Z E C°°(B), one has E[ZX\B\ = ZE[X\B\. 

(8) Independence: if X is independent of C, then E[X|C] = E[X]. 



Proof. (1) The conditional expectation is a projection by Proposition (5.2) 
and so linear. 

(2) For positivity, note that if Y = E[X\B] would be negative on a set of 
positive measure, then A = y _1 ((— oo, — 1/n]) 6 B would have positive 
probability for some n. This would lead to the contradiction < E^X] = 
E[1 A Y] < -n _1 m(A) < 0. 

(3) Use that P" < P' < P implies P" = Y'P 1 = Y'YP and P" < P gives 
P" = ZP so that Z = Y'Y almost everywhere. 

This is especially useful when applied to the algebra Cy = {0,F, Y c ,il}. 
Because X < Y almost everywhere if and only if E[X|Cy] < E[F|Cy] for 
all Y g B. 

(4) - (5) The conditional versions of the Fatou lemma or the dominated con- 
vergence theorem (2.4.3) are true, if they are true conditioned with Cy for 
each Y £ B. The tower property reduces these statements to versions with 
B = Cy which arc then on each of the sets Y, Y c the usual theorems. 

(6) Chose a sequence (a„, b n ) € R 2 such that h(x) = sup„ a n x + b n for all 
x e R. Wc get from h(X) > a n X + b n that almost surely E[h(X)\Q] > 
a n E[X\Q] + b„. These inequalities hold therefore simultaneously for all n 
and we obtain almost surely 

E[h(X)\g] > sup(a„E[X|^] + bn) = h(E[X\g\) . 

n 

The corollary is obtained with h(x) — 

(7) It is enough to condition it to each algebra Cy for Y £ B. The tower 
property reduces these statements to linearity. 
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(8) By linearity, we can assume X > 0. For B £ B and C £ C, the random 
variables X1 B and lc are independent so that 



The random variable Y = E[X\B] is B measurable and because Y1 B is 
independent of C we get 

E[(yi B )i c ] = E[ri B ]p[C] 

so that E[l BnC V] = Ep-Bnc^]- The measures on cr(B,C) 

/j;Art E[UX],i/ : ^ i-> E[1 A F] 

agree therefore on the 7r-system of the form B DC with B £ B and C € C 
and consequently everywhere on <j(B,C). □ 

Remark. From the conditional Jensen property in theorem (3.1.4), it fol- 
lows that the operation of conditional expectation is a positive and contin- 
uous operation on CP for any p > 1. 

Remark. The properties of Conditional Fatou, Lebesgue and Jensen are 
statements about functions in £}(B) and not about numbers as the usual 
theorems of Fatou, Lebesgue or Jensen. 

Remark. Is there for almost all u> £ f2 a probability measure P^ such that 



If such a map from fl to M\ (51) exists and if it is B- measurable, it is called 
a regular conditional probability given B. In general such a map oj n- P w 
does not exist. However, it is known that for a probability space (CI, A, P) 
for which is a complete separable metric space with Borel cr-algebra A, 
there exists a regular probability space for any sub cr-algebra B of A. 



Exercise. This exercise deals with conditional expectation. 

a) What is E[F|F]? 

b) Show that if E[X\A\ = and E[X\B] = 0, then E[X\a(A,B)] = 0. 

c) Given X, Y £ C 1 satisfying E[X\Y] = Y and E[Y\X] = X. Verify that 
X — Y almost everywhere. 

We add a notation which is commonly used. 

Definition. The conditional probability space (f2, A, P[-\B}) is defined by 



E[Xl BnC ] = E[X1 B 1 C ] = E[X1 B ]P[C] . 




P[B\B]=E[1 B \B] . 
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For X <G C p , one has the conditional moment E[Y P |,B] if B be a er-subalgebra 
of A. They are 23-measurable random variables and generalize the usual 
moments. Of special interest is the conditional variance: 

Definition. For X <G C 2 , the conditional variance VarLY|£>] is the random 
variable E[Y 2 |,B] — E[Y|,8] 2 . Especially, if B is generated by a random vari- 
able F, one writes VarLY|F] = E[Y 2 |y] - E[Y|F] 2 . 

Remark. Because conditional expectation is a projection, all properties 
known for the usual variance hold the more general notion of conditional 
variance. For example, if X, Z are independent random variables in C , 
then VarLY + Z\Y\ = Vax[X\Y] + Var[Z|Y]. One also has the identity 
Var[X|r] = E[(X - V[X\Y}) 2 \Y}. 



Lemma 3.1.5. (Law of total variance) For X <E £ 2 and an arbitrary random 
variable Y, one has 

VarLY] = E[Var[X|F]] +Var[E[X\Y}} . 



Proof. By the definition of the conditional variance as well as the properties 
of conditional expectation: 

Var[X] = E[X 2 } - ELY] 2 

= E[E[Y 2 |r]] -E[E[Y|F]] 2 

= E[Var[Y|y]] + E[E[Y|y] 2 ] - E[E[Y|F]] 2 

= E[Var[Y|y]] + Var[E[Y|F]] . 

□ 

Here is an application which illustrates how one can use of the conditional 
variance in applications: the Cantor distribution is the singular continuous 
distribution with the law jj, has its support on the standard Cantor set. 



Corollary 3.1.6. (Variance of the Cantor distribution) The standard Cantor 
distribution for the Cantor set on [0, 1] has the expectation 1/2 and the 
variance 1/8. 



Proof. Let X be a random variable with the Cantor distribution. By sym- 
metry, ELY] = Jq 1 x d/i(x) = 1/2. Define the er-algebra 

{0, [0,1/3), [1/3,1], [0,1]} 
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on £1 = [0, 1]. It is generated by the random variable Y = l[o,i/3)- Define 
Z = E[X\Y]. It is a random variable which is constant 1/6 on [0,1/3) 
and equal to 5/6 on [1/3,1]. It has the expectation E[Z] = (1/6)P[F = 
1] + (5/6)P[Y = 0] = 1/12 + 5/12 = 1/2 and the variance 

1 25 

Var[Z] = E[Z 2 } E[Z} 2 = -P[Y = 1] + -P[Y = 0] - 1/4 = 1/9 . 

Define the random variable W = Var[X\Y] = E[X 2 \Y] - E[X\Y} 2 = 
E[X 2 \Y] - Z 2 . It is equal to J^ /3 (x - 1/6) 2 dx on [0, 1/3] and equal to 
f 2 i 3 {x — 5/6) 2 dx on [2/3, 3/3]. By the self-similarity of the Cantor set, we 
see that W = Var[A|Y"] is actually constant and equal to Var[X]/9. The 
identity E[Var[A|F]] = Var[X]/9 implies 

Var[A] = E[Var[X|F]] + Var[E[X|F]] = E[W] + Var[Z] = A^£Hl + I . 
Solving for Var[A] gives Var[A] = 1/8. □ 



Exercise. Given a probability space (f2, A, P) and a cr-algcbra B C A. 

a) Show that the map P : X g C 1 ^ E[X\B] is a Markov operator from 
C X (A, P) to C l {B, Q), where Q is the conditional probability measure on 
(n,B) defined by Q[A\ = P[A] for A G B. 

b) The map T can also be viewed as a map on the new probability space 
(f2,i3, Q), where Q is the conditional probability. Denote this new map by 
S. Show that S is again measure preserving and invertiblc. 



Exercise, a) Given a measure preserving invertible map T : f2 — > fl we call 
(fi,T, A, P) a dynamical system. A complex number A is called an eigen- 
value of T, if there exists X <E £ 2 such that X(T) = XX. The map T is said 
to have pure point spectrum, if there exists a countable set of eigenvalues 
Xi such that their eigenfuctions Xi span C 2 . Show that if T has pure point 
spectrum, then also S has pure point spectrum. 

b) A measure preserving dynamical system (A, S, B, v) is called a factor of a 
measure preserving dynamical system (f2, T, A, fi) if there exists a measure 
preserving map U : £1 — s- A such that SoU(x) = U oT(x) for all x G fl. Ex- 
amples of factors are the system itself or the trivial system (f2, S(x) = x, fi). 
If S is a factor of T and T is a factor of S, then the two systems are called 
isomorphic. Verify that every factor of a dynamical system (O, T, A, //) can 
be realized as (0, T, 23, /z) where B is a cr-subalgebra of A. 

c) It is known that if a measure preserving transformation T on a proba- 
bility space has pure point spectrum, then the system is isomorphic to a 
translation on the compact Abelian group G which is the dual group of the 
discrete group G formed by the spectrum cr(T) C T. Describe the possible 
factors of T and their spectra. 
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Exercise. Let S7 = T 1 be the one-dimensional circle. Let A be the Borel o- 
algcbra on T 1 = R/ (27rZ) and P = dx the Lebesgue measure. Given k € N, 
denote by Bk the cr-algebra consisting of all A £ A such that A + QjZ- = 
A (mod 27r) for all 1 < n < k. What is the conditional expectation E[X \Bk] 
for a random variable X G £ ? 



3.2 Martingales 

It is typical in probability theory is that one considers several a-algebras on 
a probability space (Q,A,P). These algebras are often defined by a set of 
random variables, especially in the case of stochastic processes. Martingales 
are discrete stochastic processes which generalize the process of summing 
up IID random variables. It is a powerful tool with many applications. In 
this section we follow largly [113]. 

Definition. A sequence {An}neN of sub cr-algebras of A is called a fil- 
tration, if Aq C Ai C • • • C A. Given a filtration {A n }n£N, one calls 
(fi, A, {Ai}neN, P) a filtered space. 

Example. If = {0, 1} N is the space of all — 1 sequences with the Borel 
cr-algebra generated by the product topology and A n is the finite set of 
cylinder sets A = {x\ = oi, . . . ,x n = a n } with a» € {0, 1}, which contains 
2™ elements, then {^ n }„gn is a filtered space. 

Definition. A sequence X = {A„}„ s n of random variables is called a dis- 
crete stochastic process or simply process. It is a £ p -process, if each X n 
is in C p . A process is called adapted to the filtration {A n } if X n is ^In- 
measurable for all 7i G N. 

Example. For £1 = {0, 1} N as above, the process X n [x) = YVi=i x i ^ s 
a stochastic process adapted to the filtration. Also S n (x) = Y^i=\ x i ^ s 
adapted to the filtration. 

Definition. A /^-process which is adapted to a filtration {An} is called a 
martingale if 

E[X„|^4n_i] = X n -i 

for all n > 1. It is called a supermartingale if E[X„|.4 n _i] < X n _\ and a 
submartingale if E[X„|^4„_i] > X n _i. If we mean either submartingale or 
supermartingale (or martingale) we speak of a semimartingale. 



Remark. It immediately follows that for a martingale 

E[X„|^4 m ] = X m 

if m < 7i and that E[A„] is constant. Allan Gut mentions in [35] that a 
martingale is an allegory for "life" itself: the expected state of the future 
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given the past history is equal the present state and on average, nothing 
happens. 

'J J : S V ^ V 



Figure. A random variable X on the unit square defines a gray scale picture 
if we interpret X(x, y) is the gray value at the point (x 7 y). It shows Joseph 
Leo Doob (1910-2004), who developed basic martingale theory and many 
applications. The partitions A n = {[k/2 n (k + l)/2") x [j/2 n (j + l)/2")} 
define a filtration of Q. = [0, 1] x [0, 1]. The sequence of pictures shows the 
conditional expectations Fj[X\A n ]- It is a martingale. 

Exercise. Determine from the following sequence of pictures, whether it is a 
supermartingale or a submartingale. The images get brighter and brighter 
in average as the resolution becomes better. 



4 %&&B 



Definition. If a martingale X n is given with respect to a filtered space 
A n = <x(Yoj ■ • ■ j Y n ), where Y n is a given process, X is is called a martingale 
with respect Y. 

Remark. The word " martingale" means a gambling system in which losing 
bets are doubled. It is also the name of a part of a horse's harness or a belt 
on the back of a man's coat. 
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Remark. If X is a supermartingale, then —X is a submartingale and vice 
versa. A supermartingale, which is also a submartingale is a martingale. 
Since we can change X to X — Xq without destroying any of the martingale 
properties, we could assume the process is null at which means Xq = 0. 



Exercise, a) Verify that if X n ,Y n are two submartingales, then sup(X, Y) 
is a submartingale. 

b) If X n is a submartingale, then E[X„] < E[X„_i]. 

c) If X n is a martingale, then E[X„] = E[X„_i]. 



Remark. Given a martingale. From the tower property of conditional ex- 
pectation follows that for m < n 

E[X n \A m ] = E[E[X n \A n -i}\A m ] = E[X„_i|.A m ] = • • • = X m . 



Example. Sum of independent random variables 

Let Xi G C 1 be a sequence of independent random variables with mean 
E[Xi] = 0. Define S = 0, S n = ££ =1 X k an d A n = a(X u ...,X n ) with 
Aq = {0,0}. Then S n is a martingale since S n is an {yl„}-adapted C 1 - 
process and 

E[5„|^l„_i] = Ef.S'n-ilvAn-i] + E[X n |.A n _i] = 5„_i + E[X n ] = 5„_i . 

We have used linearity, the independence property of the conditional ex- 
pectation. 

Example. Conditional expectation 

Given a random variable X e C 1 on a filtered space (fi, A, {A n } n eN, P)- 
Then X n = ~E[X\A n ] is a martingale. 

Especially: given a sequence Y n of random variables. Then A n = <j(Yo, ■ ■ ■ , Y n ) 
is a filtered space and X n = E[X|loj • • • i Yn] is a martingale. Proof: by the 
tower property 

ELY„L4„-i] = E[X n |y ,...,y n _i] 

= E[E[x|yo,...,yn]|yo,...,yn-i] 
= E[X|Yo, . . . , y Tl _i] = X n _x . 

verifying the martingale property E^nl^n-i] = X n _\. 
We say X is a martingale with respect to Y . Note that because X n is by 
definition <j{Yq, . . . , Y n ) -measurable, there exist Borel measurable functions 
hn : K ,l+1 K such that I n = /i„(y , . . . , y n -i). 

Example. Product of positive variables 

Given a sequence y„ of independent random variables Y„ > satisfying 
with E[y„] = 1. Define X = 1 and X n = n"=o ^ an< ^ = CT (yi! ■ • ■ i Yn)- 
Then X n is a martingale. This is an exercise. Note that the martingale 
property does not follow directly by taking logarithms. 
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Example. Product of matrix- valued random variables 

Given a sequence of independent random variables Z n with values in the 
group GL(N, K) of invertiblc N x N matrices and let A n = a(Z\, . . . , Z n ). 
Assume E[log||Z„||] < 0, if \\Z n \\ denotes the norm of the matrix (the 
square root of the maximal eigenvalue of Z n ■ Z*, where if* is the adjoint). 
Define the real- valued random variables X n = log \ \Z± ■ Z% ■ ■ ■ Z n \\, where ■ 
denotes matrix multiplication. Because X n < log \ \Z n \ \ + X n _\, we get 



E[X„|X-i] < E[log||Z„|| | An-i] + E[X„_i|Ai-i] 
= E[log||Z„||]+A„_i < A„_i 



so that X n is a supermartingale. In ergodic theory, such a matrix-valued 
process X n is called sub-additive. 



Example. If Z n is a sequence of matrix valued random variables, we can 
also look at the sequence of random variables Y n = \ \Z\ ■ Z^ ■ ■ ■ Z n \ \ . If 
E[||Z„||] = 1, then Y n is a supermartingale. 



Example. Polya's urn scheme 

An urn contains initially a red and a black ball. At each time n > 1, a 
ball is taken randomly, its color noted, and both this ball and another 
ball of the same color are placed back into the urn. Like this, after n 
draws, the urn contains n + 2 balls. Define Y n as the number of black balls 
after n moves and X n = Y n /(n + 2), the fraction of black balls. We claim 
that X is a martingale with respect to Y: the random variables Y n take 
values in {1, . . . , n + 1}. Clearly P[F„+i = k + l\Y n = k] = k/(n + 2) and 
P[F n+ i = k\Y n = k] = 1 - k/(n + 2). Therefore 



E[X rH . 1 |yi,...,y n ] = — |— E[Y n+1 \Y u ...,Y n ] 

n + 3 

= — j-^P[y„ +1 = k + l\Y n = k] ■ Y n+1 
+P[Y n+1 =k\Y n = k]-Y n 
- .[(Y n + l)-^+y n (l- 



n + 3 LV n + 2 v n + 2' 

X„ . 



n + 2 



Note that X n is not independent of X n _\. The process "learns" in the sense 
that if there are more black balls, then the winning chances are better. 
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Figure. A typical run of 30 
experiments with Polya's urn 
scheme. 
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Example. Branching processes 

Let Z n i be IID, integer-valued random variables with positive finite mean 
m. Define Yq — 1 and 

y n 
fc=l 

with the convention that for Y n — 0, the sum is zero. Wc claim that X n = 
Y n /m n is a martingale with respect to Y. By the independence of Y n and 
Z n i , i > 1 , we have for every n 

E[Y n+1 \Y , . . . , y n ] = E[£ Z nk \Y , ...,y tt ] = EQ2 Z nk ] = mY n 

k=l fe=l 

so that 

E[X n+1 |lo, ■ ■ ■ = E[Y n+1 |Y , ■ - -r n ]/m n+1 = m^/m^ 1 = X n . 

The branching process can be used to model population growth, disease 
epidemic or nuclear reactions. In the first case, think of Y n as the size of a 
population at time n and with Z n i the number of progenies of the i — th 
member of the population, in the n'th generation. 



Figure. A typical growth of Y n of 
a branching process. In this ex- 
ample, the random variables Z n i 
had a Poisson distribution with 
mean m = 1.1. It is possible that 
the process dies out, but often, it 
grows exponentially. 
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Proposition 3.2.1. Let A n be a fixed filtered sequence of cr-algebras. Lin- 
ear combinations of martingales over A n are again martingales over A n . 
Submartingales and supermartingales form cones: if for example X, Y are 
submartingales and a, b > 0, then aX + bY is a submartingale. 



Proof. Use the linearity and positivity of the conditional expectation. □ 



Proposition 3.2.2. a) If X is a martingale and u is convex such that u(X n ) £ 
C l , then Y = u(X) is a submartingale. Especially, if X is a martingale, 
then \X\ is a submartingale. 

b) If u is monotone and convex and X is a submartingale such that u(X n ) £ 
£}, then u(X) is a submartingale. 



Proof, a) We have by the conditional Jensen property (3.1.4) 

Y n = u{X n ) = u{E[X n+1 \An]) < E[u{X n+1 )\An} = E[Y n+1 \ \A n ] . 

b) Use the conditional Jensen property again and the monotonicity of u to 
get 

Y n = u{X n ) < u(E[X n+1 \A n ]) < E[u(X n+1 )\An] = E[Y n+1 \ \A n ] . 

□ 

Definition. A stochastic process C = {C n }„>i is called previsible if C n is 
^l n _i-mcasurable. A process X is called bounded, if X n 6 C°° and if there 
exists K S K such that H^Hoo < K for all n£N. 

Previsible processes can only see the past and not see the future. In some 
sense we can predict them. 

Definition. Given a semimartingale X and a previsible process C, the pro- 
cess 

/n 
CdX) n = Y,Ck(X k -X k - 1 ) . 
fc=l 

It is called a discrete stochastic integral or a martingale transform. 



Theorem 3.2.3 (The system can't be beaten). If C is a bounded nonnega- 
tive previsible process and X is a supermartingale then J C dX is a super- 
martingale. The same statement is true for submartingales and martingales. 
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Proof. Let Y = J C dX . From the property of "extracting knowledge" in 
theorem (3.1.4), we get 

E[Y n -Y n _ 1 \A n - 1 ]=E[C n (X n -X n _ 1 )\A n - 1 } = C n -ELY n -X„_ 1 |.A n -i] < 

because C„ is nonnegative and X n is a supermartingale. □ 

Remark. If one wants to relax the boundedness of C, then one has to 
strengthen the condition for X . The proposition stays true, if both C and 
X are £ 2 -processes. 

Remark. Here is an interpretation: if X n represents your capital in a game, 
then X n — X n _i are the net winnings per unit stake. If C n is the stake on 
game n, then 

/n 
CdX = Y,C k {X k -X k _ 1 ) 
k=l 

are the total winnings up to time n. A martingale represents a fair game 
since E[X„ — An-il^^i] = 0, whereas a supermartingale is a game which 
is unfavorable to you. The above proposition tells that you can not find a 
strategy for putting your stake to make the game fair. 



Figure. In this example, X n = 
±1 with probability 1/2 and 
C n = 1 if X n ^i is even and 
C n = if X n -i is odd. The orig- 
inal process X n is a symmetric 
random walk and so a martin- 
gale. The new process J C dX is 
again a martingale. 




Exercise, a) Let Y\, Yi, . . . be a sequence of independent non- negative ran- 
dom variables satisfying E[Yk] = 1 for all k £ N. Define Xq = l,X n = 
Y\ ■ ■ ■ Y n and A n = cr(Yi, Y2, . . . , Y n ). Show that X n is a martingale, 
b) Let Z n be a sequence of independent random variables taking values in 
the set ofnxn matrices satisfying E[||Z„||] = 1. Define Xq = l,X n = 
\\Z\ ■ ■ ■ Z n \\. Show that X n is a supermartingale. 



Definition. A random variable T with values in N = N U {00} is called 
a random time. Define Aoo = o-({J n>0 A n )- A random time T is called a 
stopping time with respect to a filtration A n , if {T < n} € A„ for all 

net 



144 



Chapter 3. Discrete Stochastic Processes 



Remark. A random time T is a stopping time if and only if {T = n } € _4, n 
for all ?i e N since {T < n} = UiK/Kni 71 = fc l e 

Remark. Here is an interpretation: stopping times are random times, whose 
occurrence can be determined without prc-knowlcdgc of the future. The 
term comes from gambling. A gambler is forced to stop to play if his capital 
is zero. Whether or not you stop after the n— th game depends only on the 
history up to and including the time n. 

Example. First entry time. 

Let X n be a .A„-adapted process and given a Borel set B € B in R d . Define 

T{uS) = inf{n > | X n (w) G B] 

which is the time of first entry of X n into B. The set {T = oo} is the set 
which never enters into B. Obviously 

n 

{T < n} = \J {X k £B}e A n 

k=Q 

so that T is a stopping time. 

Example. " Continuous Black- Jack" : let Xi be IID random variables with 
uniform distribution in [0,1]. Define S n = Y^l=i Xi and let T(uj) be the 
smallest integer so that S n (u)) > 1. This is a stopping time. A popular 
problem asks for the expectation of this random variable T: How many 
"cards" Xi do we have to draw until we get busted and the sum is larger 
than 1? We obviously have P[T = 1] = 0. Now, P[T = 2] = P[X 2 > 1 - X{\ 
is the area of region {{x,y) £ [0, 1] x [0, 1] | y > 1 — x } which is 1/2. 
Similarly P[T = 3] = P[A 3 > 1 - X x - X 2 ] is the volume of the solid 
{(x,y, z) <E [0, l] 3 | z > 1 — x — y } which is 1/6 = 1/3!. Inductively we 
see P[T = k) = 1/fc! and the expectation of T is E[T] = J2T=i k / kl = 
YlkLo^-/^- = e - This means that if we play Black- Jack with uniformly 
distributed random variables and threshold 1, we expect to get busted in 
more than 2, but less than 3 "cards". 

Example. Last exit time. 

Assume the same setup as in 1). But this time 

T(u) = sup{n > | X n (uj) e B] 

is not a stopping time since it is impossible to know that X will return to 
B after some time k without knowing the whole future. 



Proposition 3.2.4. Let T\,T 2 be two stopping times. The infimum T\ A T 2 , 
the maximum T\ V T 2 as well as the sum T\ + T 2 arc stopping times. 



3.2. Martingales 



145 



Proof. This is obvious from the definition because ^In-measurable functions 
are closed by taking minima, maxima and sums. □ 

Definition. Given a stochastic process X n which is adapted to a filtration 
A n and let T be a stopping time with respect to A n , define the random 
variable 

or equivalently Xt = J2n°=o ^nl{T=n}- The process X^ = Xtau is called 
the stopped process. It is equal to Xt for times T < n and equal to X n if 
T > n. 



Proposition 3.2.5. If X is a supermartingale and T is a stopping time, then 
the stopped process X T is a supermartingale. In particular E[X T ] < E[Xo]. 
The same statement is true if supermartingale is replaced by martingale in 
which case E[X T ] = E[X ]. 



Proof. Define the "stake process" by CiP = ^T<n- You can think of 

it as betting 1 unit and quit playing immediately after time T. Define then 
the "winning process" 

/n 
dX) n = C ( P{X k - X fc _r) = X TA „ - X . 

k=l 

or shortly J dX = Xt — Xq. The process C is previsible, since it can 

only take values and 1 and {C n T) = 0} = {T<n-l}e An-i- The 
claim follows from the "system can't be beaten" theorem. □ 

Remark. It is important that we take the stopped process X T and not the 
random variable Xt- 

for the random walk X on Z starting at 0, let T be the stopping time 
T = inf{n | X n = 1 }. This is the martingale strategy in casino which gave 
the name of these processes. As we will see later on, the random walk is 
recurrent P[T < oo] = 1 in one dimensions. However 

1 = E[X T ] ¥= E[X ] = . 
The above theorem gives E[X T ] = E[X ]. 

When can we say E[Xr] = E[Xo]? The answer gives Doob's optimal stop- 
ping time theorem: 
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Theorem 3.2.6 (Doob's optimal stopping time theorem). Let X be a 
supermartingale and T be a stopping time. If one of the five following 
conditions are true: 



(i) T is bounded. 

(ii) X is bounded and T is almost everywhere finite. 

(iii) T e C 1 and \X n - X n _i| < K for some K > 0. 

(iv) X T g C 1 and lim^^ E[X k ; {T > k }} = 0. 

(v) X is uniformly integrable and T is almost everywhere finite. 

then E[X T ] < E[Xq\. 

If X is a martingale and any of the five conditions is true, then ELYy] = 
ELY ]. 



Proof. We know that E[Xtau — Xq] < because X is a supermartingale. 

(i) Because T is bounded, we can take n = supT(o;) < oo and get 

E[X T - X ] = nx TAn -X ]<0. 

(ii) Use the dominated convergence theorem (2.4.3) to get 

lim E[X TA „ -X ]<0. 

n— f oo 

(iii) We estimate 

TAn TAn 

\Xtau -X \ = \J2 x k~ Xk-i\ < \ Xk - Xk ^\ ^ TK ■ 

k=l k=l 

Because T g £ , the result follows from the dominated convergence theo- 
rem (2.4.3). Since for each n we have Xtau — Xq < 0, this remains true 
in the limit n — > oo. 

(iv) By (i), we get E[X ] > E[X TAfe ] = E[X T ; {T < k}] + E[X fe ; {T > k}} 
and taking the limit gives E[X ] > lim^oo E[X fe ; {T < k}} -> E[X T ] by 
the dominated convergence theorem (2.4.3) and the assumption. 

(v) The uniformly intcgrability E[|X„|; \X n \ > R] — > for R —> oo assures 
that X T g C 1 since E[\X T \] < k ■ maxi<j< fc E[\X k \] + sup„ E[|Jf n |; {T > 
k}] < oo. Since \E[X k ;{T > k}}\ < sup„ E[\X n \; {T > k}] -> 0, we can 
apply (iv). 

If X is a martingale, we use the supermartingale case for both X and 
-X. □ 



Remark. The interpretation of this result is that a fair game cannot be 
made unfair by sampling it with bounded stopping times. 
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Theorem 3.2.7 (No winning strategy). Assume A is a martingale and sup- 
pose \X n — X n _-y\ is bounded. Given a previsible process C which is bounded 
and let T £ C 1 be a stopping time, then E[(J CdA)^] = 0. 



Proof. We know that J C dX is a martingale and since ( f C dX)o = 0, the 
claim follows from the optimal stopping time theorem part (iii). □ 

Remark. The martingale strategy mentioned in the introduction shows 
that for unbounded stopping times, there is a winning strategy. With the 
martingale strategy one has T = n with probability 1/2". The player always 
wins, she just has to double the bet until the coin changes sign. But it 
assumes an " infinitely thick wallet" . With a finite but large initial capital, 
there is a very small risk to lose, but then the loss is large. You see that in 
the real world: players with large capital in the stock market mostly win, 
but if they lose, their loss can be huge. 

Martingales can be characterized involving stopping times: 



Theorem 3.2.8 (Komatsu's lemma). Let X be an „4„-adapted sequence of 
random variables in C 1 such that for every bounded stopping time T 

E[X T ] = E[X ] , 

then X is a martingale with respect to A n . 



Proof. Fix n EN and A £ An- The map 

rp ,ii / n oj £ A 
A \ n+l lu ^ A 

is a stopping time because cr(T) = {0, A, A c ,il } C A n - Apply E[At] = 
E[Ao] and E[At'] = E[Ao] for the bounded constant stopping time T' = 
n + 1 to get 

E[X n ;A] + E[X n+1 ; A c ] = E[X T ] = E[X ] = E[X T ,} = E[X n+1 ] 

= E[X n+l -A]+E[X n+l -A c ] 

so that E[X„_|_i; A] = E[X n ; A]. Since this is true, for any A £ A n , we know 
that E[X n +i|„4„] = E[A„|.4„] = X n and A is a martingale. □ 

Example. The gambler's ruin problem is the following question: Let Yi be 
IID with P[Yi = ±1] = 1/2 and let X n = J2k=i Y i be the random walk 
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with Xq = 0. We know that X is a martingale with respect to Y. Given 
a, b > 0, we define the stopping time 

T = min{n > | X n = b, or X n = —a } . 

We want to compute P[Xt = —a] and P[Xt = b] in dependence of a, b. 



Figure. Three samples of a pro- 
cess X n starting at Xq = 0. 
The process is stopped with the 
stopping time T, when X n hits 
the lower bound —a or the upper 
bound b. If X n is the winning of a 
first gambler, which is the loss of 
a second gambler, then T is the 
time, for which one of the gam- 
blers is broke. The initial capital 
of the first gambler is a, the ini- 
tial capital of the second gambler 
is b. 



Remark. If Yi are the outcomes of a series of fair gambles between two 
players A and B and the random variables X n are the net change in the 
fortune of the gamblers after n independent games. If at the beginning, A 
has fortune a and B has fortune b, then P[Xy = —a] is the ruin probability 
of A and P[Xy = b] is the ruin probability of B. 



Proposition 3.2.9. 

P[X T = -a] = 1 - P[X T = b] 




(a + b) 



Proof. T is finite almost everywhere. One can see this by the law of the 
iterated logarithm, 

lim sup — — = 1 , lim inf — — = — 1 . 

(We will give later a direct proof the finiteness of T, when we treat the 
random walk in more detail.) It follows that P[Xt = —a] = 1 — Pp^T = b]. 
We check that X^ satisfies condition (iv) in Doob's stopping time theorem: 
since Xt takes values in {a, b }, it is in C 1 and because on the set {T > k }, 
the value of X k is in (-a, b), we have \E[X k ; {T > k }] | < maxja, b}P[T > 
k] 0. □ 



3.3. Doob's convergence theorem 
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Remark. The boundedness of T is necessary in Doob's stopping time the- 
orem. Let T = inf{n | X n = 1 }. Then E[X T ] = 1 but E[X ] = 0] which 
shows that some condition on T or X has to be imposed. This fact leads 
to the "martingale" gambling strategy defined by doubling the bet when 
loosing. If the casinos would not impose a bound on the possible inputs, 
this gambling strategy would lead to wins. But you have to go there with 
enough money. One can see it also like this, If you are A and the casino is 
B and b = 1, a = oo then P[Xy = b] = 1, which means that the casino is 
ruined with probability 1. 



Theorem 3.2.10 (Wald's identity). Assume T is a stopping time of a C 1 - 
process Y for which Yi are L°° IID random variables with expectation 
E[Yi] = m and T e C 1 . The process S„ — Ylk=i satisfies 



Proof. The process X n = S n — n E[Yj] is a martingale satisfying condition 
(iii) in Doob's stopping time theorem. Therefore 



In other words, if we play a game where the expected gain in each step is 
m and the game is stopped with a random time T which has expectation 
t = E[T], then wc expect to win mt. 

Remark. One could assume Y to be a L 2 process and T in L 2 . 

3.3 Doob's convergence theorem 

Definition. Given a stochastic process X and two real numbers a < b, we 
define the random variable 

U n [a,b]{uj) = max{fc e N | 3 

< si < t\ < ■ ■ ■ < Sk < tk < n, 
X Si (u) < a, X u (cj) > 6, 1 < i < k } . 

It is lcalled the number of up-crossings in [a, b]. Denote by Uoo[a,b) the 



E[S* T ] = mE[T 



= E[X Q ] = E[X T ] = E[S T - TE[Yi]] . 



Now solve for E[S T ] = E[T]E[Fi] = mE[T], 



a 




Because n t-t U n [a, b] is monotone, this limit exists in N U {oo}. 
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Figure. A random walk crossing 
two values a < b. An up-crossing 
is a time s, where X s < a un- 
til the time, when the first time 
Xt > b. The random variable 
U„.[a,b] with values in N mea- 
sures the number of up-crossings 
in the time interval [0, n] . 




Theorem 3.3.1 (Doob's up-crossing inequality). If X is a supermartingalc. 
Then 

(b-a)E[U n [a,b]\ <E[(X n -a)-] . 



Proof. Define C\ = l{x <a } an d inductively for n > 2 the process 

C n ■= l{C„_i = l }1{A'„_!<6 } + l{C„_i=0 } 1 {X„_ 1 <a } • 

It is a previsible process. Define the winning process Y = J C dX which 
satisfies by definition Yo = 0. We have the winning inequality 

Y n (w) > (b-a)U n [a,b](u)-(Xn(u)-a)- . 

Every up-crossing of [a, b] increases the Y- value (the winning) by at least 
(b — a), while (X n — a)~ is essentially the loss during the last interval of 
play. 

Since C is previsible, bounded and nonncgative, we know that Y n is also a 
supermartingale (see "the system can't be beaten") and we have therefore 
E[Y n ] < 0. Taking expectation of the winning inequality leads to the claim. 

□ 

Remark. The proof uses the following strategy for putting your stakes C: 
wait until X gets below a. Play then unit stakes until X gets above b and 
stop playing. Wait again until X gets below a, etc. 

Definition. We say, a stochastic process X n is bounded in C p , if there exists 
such that \\X n \\ p < M for all neN. 



Corollary 3.3.2. If X is a supermartingale which is bounded in £ . Then 

P[U QC [a,b}=oo}=0 . 
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Proof. By the up-crossing lemma, we have for each neN 

{b-a)E[U n [a,b}} < |a|+E[LY„|] < |o| + sup ||X n ||i < oo . 

n 

By the dominated convergence theorem (2.4.3) 

(b — a)E[C/ 00 [a, &]] < oo , 

which gives the claim. □ 

Remark. If S„ = X)fc=i Xk is the one dimensional random walk, then it is 
a martingale which is unbounded in C . In this case, 'E[U 00 [a, b}] = oo. 



Theorem 3.3.3 (Doob's convergence theorem). Let X n be a supermartingalc 
which is bounded in C . Then 

= lim X n 

n—^oo 



exists almost everywhere. 



Proof. 



A = {wefi X n has no limit in [— oo, oo] } 
= {cj G SI | liminf X n < lim sup X n } 
= {lo G SI | liminf X n < a < b < lim sup X n } 



1,6 



u A ° 

a<b,a,beQ 



Since A a ^ C {Uoo[a, b] = oo } we have P[A a ,&] = and therefore also 
P[A] = 0. Therefore X^ = lim„_ i . 00 X n exists almost surely. By Fatou's 
lemma 

Epfool] = E [liminf |X n |) < liminf E[|X„|] < supE[|X„|] < oo 

so that P[Xoo < oo] = 1. □ 

Example. Let X be a random variable on ([0, 1), A, P), where P is the 
Lebesgue measure. The finite c-algebra A n generated by the intervals 

r k k + 1 , 

L On ' 2™ 

defines a filtration and X n = E[X|„4„] is a martingale which converges. We 
will see below with Levys upward theorem (3.4.2 that the limit actually is 
the random variable X. 



152 



Chapter 3. Discrete Stochastic Processes 



Example. Let Xk be IID random variables in C 1 . For < A < 1, the 
branching random walk S n = X)a-=o ^ Xk is a martingale which is bounded 
in £ because 

l|s„||i< y^II^oIIi • 

The martingale converges by Doob's convergence theorem almost surely. 
One can also deduce this from Kolmogorov's theorem (2.11.3) if Xk G C 2 . 
Doobs convergence theorem (3.3.3) assures convergence for Xk € £ . 

Remark. Of course, we can replace supermartingale by submartingale or 
martingale in the theorem. 

Example. We look again at Polya's urn scheme, which was defined earlier. 
Since the process Y giving the fraction of black balls is a martingale and 
bounded < Y < 1, we can apply the convergence theorem: there exists 
Foo with Y n -> Yoo. 



Corollary 3.3.4. If X is a non-negative supermartingale, then X ( 
linin-^oo X n exists almost everywhere and is finite. 



Proof. Since the supermartingale property gives E[|X n |] = E[A„] < E[X ], 
the process X n is bounded in £ . Apply Doob's convergence theorem. □ 

Remark. This corollary is also true for non-positive submartingales or mar- 
tingales, which are either nonnegative or non-positive. 

Example. For the Branching process, we had IID random variables Z n i 
with positive finite mean m and defined Yb = 0, Y n+ \ = J2k=i Znk- We 
saw that the process X n = Y n /m n is non-negative and a martingale. Ac- 
cording to the above corollary, the limit X^ exists almost everywhere. It 
is an interesting problem to find the distribution of X x : Assume Z n i have 
the generating function f(9) = E[8 Zni ]. 

(i) Y n has the generating function f n (6) = /(/" _1 )(6'). 

We prove this by induction. For n = 1 this is trivial. Using the independence 

of Z n k we have 

E[6 Y ^\Y n = k) =f(9) k 

and so 

E[0 Y " +1 \Y n ] = f(0) z ~ . 
By the tower property, this leads to 

E[6 Y " +1 ] = E[f(9) z "} . 

Write a = f(9) and use induction to simplify the right hand side to 

E[f(9) Y »] = E{a Y "} = P(a) = f n (f(d)) = f n+1 (e) . 
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(ii) In order to find the distribution of Xoc we calculate instead the char- 
acteristic function 

L(X) = L(X oc )(A) = ElexptzAX^)] . 

Since X n — > Xoo almost everywhere, we have L(X n )(X) — > L(X 00 )(X). 
Since X n = Y n /m n and E[6> y ™] = f n {9), we have 

L(X n )(X) = /V Vm ") 
so that L satisfies the functional equation 

L(Xm) = f(L(X)) . 



Theorem 3.3.5 (Limit distribution of the branching process). For the 
branching process defined by IID random variables Z n i having the gen- 
erating function /, the Fourier transform L(X) = E[e lAXo °] of the distribu- 
tion of the limit martingale Xoc can be computed by solving the functional 
equation 

L(X-m) = f(L(X)) . 



Remark. If / has no analytic extension to the complex plane, we have to 
replace the Fourier transform with the Laplace transform 

L(X) = E[e- AX ~] . 

Remark. Related to Doob's convergence theorem for supermartingales is 
Kingman's subadditive ergodic theorem, which generalizes Birkhoff ergodic 
theorem and which we state without proof. Neither of the two theorems 
are however corollaries of each other. 

Definition. A sequence of random variables X n is called subadditive with 
respect to a measure preserving transformation T, if X m+n < X m +X n (T m ) 
almost everywhere. 



Theorem 3.3.6 (The subadditive ergodic theorem of Kingmann). Given a 
sequence of random variables, which X n : X — > R U {— oo} with X+ := 
max(0,X„) £ L X (X) and which is subadditive with respect to a measure 
preserving transformation T. Then there exists a T-invariant integrable 
measurable function X : il — > R U {— oo} such that ^X n (x) —> X(x) for 
almost all x E X. Furthermore iE[X„] — > E[X]. 
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If the condition of boundedness of the process in Doob's convergence the- 
orem is strengthened a bit by assuming that X n is uniformly integrable, 
then one can reverse in some sense the convergence theorem: 



Theorem 3.3.7 (Doob's convergence theorem for uniformly integrable su- 
pcrmartingalcs). A supcrmartingale X n is uniformly integrable if and only 
if there exists X such that X n — s- X in C . 



Proof. If X n is uniformly integrable, then X n is bounded in C and Doob's 
convergence theorem gives X n — > X almost everywhere. But a uniformly 
integrable family X n which converges almost everywhere converges in C 1 . 
On the other hand, a sequence X n G C 1 converging to X G C 1 is uniformly 
integrable. □ 



Theorem 3.3.8 (Characterization of uniformly integrable martingales). An 
_A n -adapted process is an uniformly integrable martingale if and only if 
X n -> X in C 1 and X n = E[X\A n }. 



Proof. By Doob's convergence theorem (3.3.7), we know the "only if'-part. 
To prove the "if part, assume X n = E[X\A n ] —> X. We already know that 
X n = E[X\A n ] is a martingale. What we have to show is that it is uniformly 
integrable. 

Given e > 0. Choose 5 > such that for all A G A, the condition P[A] < S 
implies E[|X|;A] < e. Choose further K G R such that R- 1 ■ E[\X\] < S. 
By Jensen's inequality 

E[\X n \) = E[\E[X\A n ]\] < E[E[\X\\A n }} < E[\X\) . 

Therefore 

K ■ P[\X n \ > K] < E[\X n \] < E[\X\] <S K 

so that P[|X„| > K] < 5. By definition of conditional expectation, \X n \ < 
E[|X||^„] and {\X n \ > K} G A n 

E[\X n \; \X n \ > K] < E[\X\; \X n \ > K] < e . 

□ 

Remark. As a summary we can say that supcrmartingale X n which is either 
bounded in C 1 or nonncgative or uniformly integrable converges almost 
everywhere. 
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Exercise. Let 5 and T be stopping times satisfying S < T. 

a) Show that the process 

C n (cj) = l{5(w)<n<T(w)} 

is previsible. 

b) Show that for every supermartingale X and stopping times S < T the 
inequality 

E[X T ] < E[X S ] 

holds. 



Exercise. In Polya's urn process, let Y n be the number of black balls after 
n steps. Let X n = Y n /(n + 2) be the fraction of black balls. We have seen 
that X is a martingale. 

a) Prove that P[Y n = k] = l/(n + 1) for every 1 < k < n + 1. 

b) Compute the distribution of the limit X^ . 



Exercise, a) Which polynomials / can you realize as generating functions 
of a probability distribution? Denote this class of polynomials with V . 

b) Design a martingale X n , where the iteration of polynomials P€? plays 
a role. 

c) Use one of the consequences of Doob's convergence theorem to show 
that the dynamics of every polynomial P € V on the positive axis can be 
conjugated to a linear map T : z i— > mz: there exists a map L such that 

LoT(z) = PoL(z) 

for every z € R + . 



Example. The branching process Y n+ i = Ylk=i ^ nk defined by random 
variables Z n k having generating function / and mean m defines a mar- 
tingale X n = Y n /m n . We have seen that the Laplace transform L(X) = 
E[e _AX=c ] of the limit Xoa satisfies the functional equation 

L(mX) = f(L(Xj) . 

We assume that the IID random variables Z n k have the geometric distribu- 
tion V[Z = k] = p(l — p) k = pq k with parameter < p < 1. The probability 
generating function of this distribution is 

oo 

k=l ^ 
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As we have seen in proposition (2.12.5), 

oo 

ti p 
The function f n (8) can be computed as 

pm n (l-6) +q6 -p 



no) 



This is because / is a Mobius transformation and iterating / corresponds 

p 



qm n {\ -9)+q6 -p 
ation and i 

This power can be computed by 



-q i 



to look at the power A n = 
diagonalisating A: 

A n = (q-p)- 1 

We get therefore 

L(X) = E[e- xx ™} = Urn E[e- Ay "/ m "] = lim f n {e x/r 



' 1 


p 




' p n 




q -p 


1 


q _ 




q n 




-1 1 



) 



pX 



p 



qX + q-p 



If m < 1, then the law of is a Dirac mass at 0. This means that the 
process dies out. We see that in this case directly that lim„_ i . 00 f n (8) = 1. In 
the case m > 1, the law of X^ has a point mass at of weight p/q = 1/m 
and an absolutely continuous part (1/m — l) 2 ^ 1 /" 1-1 ^ dx. This can be 
seen by performing a " look up" in a table of Laplace transforms 



L(X) - V A0 + / (1 - p/qfeWi-Q* ■ 

q Jo 



- Xx 



dx 



Definition. Define p n = P[Y n = 0], the probability that the process dies 
out until time n. Since p n = f n (0) we have p n +i = fipn)- If f(p) = p, p is 
called the extinction probability. 



Proposition 3.3.9. For a branching process with E[Z] > 1, the extinction 
probability is the unique solution of f(x) = x in (0, 1). For E[Z] < 1, the 
extinction probability is 1. 



Proof. The generating function f(0) = E[8 Z ] = J2n=o P i Z = n ] 0n = 
J2 n Pn® n i s ana lytic in [0,1]. It is nondecreasing and satisfies /(l) = 1. 
If we assume that P[Z = 0] > 0, then /(0) > and there exists a unique 
solution of f(x) = x satisfying f'{x) < 1. The orbit f n (u) converges to 
this fixed point for every u £ (0, 1) and this fixed point is the extinction 
probability of the process. The value of /'(0) = E[Z] decides whether there 
exists an attracting fixed point in the interval (0, 1) or not. □ 
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3.4 Levy's upward and downward theorems 



Lemma 3.4.1. Given X <E C . Then the class of random variables 

{Y = E[X\B] | B C A, B is a- algebra } 
is uniformly integrable. 



Proof. Given e > 0. Choose S > such that for all A e A, P[A] < 5 
implies E[|X|;A] < e. Choose further Jfel such that K^ 1 ■ E[\X\] < <L 
By Jensen's inequality, Y" = E[X|B] satisfies 

E[|y|]=E[|E[X|B]|]<E[E[|X||B]]<E[|X|]. 

Therefore 

K ■ P[\Y\ >K}< E[\Y\] < E[\X\] <8K 

so that P[|Y"| > K] < 5. By definition of conditional expectation, \Y\ < 
E[\X\\B] and {\Y\ > K } E B 

E[\X B \; \X B \ > K] < E[\X\; \X B \ > K] < e . 

a 

Definition. Denote by Aoo the c-algebra generated by [J n A n . 



Theorem 3.4.2 (Levy's upward theorem). Given X 6 C . Then X„ = 
E[X\An] is a uniformly integrable martingale and X n converges in C L to 
-^oo = E[X|^4oo]. 



Proof. The process X is a martingale. The sequence X n is uniformly in- 
tegrable by the above lemma. Therefore X^ exists almost everywhere by 
Doob's convergence theorem for uniformly integrable martingales, and since 
the family X n is uniformly integrable, the convergence is in C . We have 
to show that X^ = Y := E[X\Aoo]. 

By proving the claim for the positive and negative part, we can assume 
that X > (and so Y > 0). Consider the two measures 

Qi(A) = ELY; A], Q 2 {A)=E[X OQ -A] . 

Since E[.X 00 |./4 n ] = ELY|.4 n ], we know that Q\ and Q2 agree on the tt- 
system [J n A n . They agree therefore everywhere on Aoo- Define the event 
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A = MXlAoc] > Xoo } e Aoo- Since Q^A) - Q 2 (A) = E[E[X\Aoo] - 
Xoo;A] = we have EfX^oo] < Aoo almost everywhere. Similarly also 
Xoo < X\Aoo] almost everywhere. □ 

As an application, we see a martingale proof of Kolmogorov's — 1 law: 

Corollary 3.4.3. For any sequence A n of independent cr-algebras, the tail o- 
algebra T = f] n B n with B n the algebra generated by [j m>n A m is trivial. 

Proof. Given A e T, define X = 1a € £°°(T) and the cr-algebras C n = 
cr(Ai, . . . ,A n ). By Levy's upward theorem (3.4.2), 

X = E[X|Coo] = Urn E[X\C n ] . 

n— ► oo 

But since C n is independent of A n and (8) in Theorem (3.1.4), we have 

P[A]=E[X]=E[X\C n ]^X . 

Because X is — 1 valued and X = P[A], it must be constant and so 
P[A] = 1 or P[A] =0. □ 

Definition. A sequence A- n of cr-algebras A- n satisfying 

• • • c A- n c .4_(„_i) c • • • c A-i 

is called a downward filtration. Define A-oo = C\ n -A-n- 



Theorem 3.4.4 (Levy's downward theorem). Given a downward filtration 
A- rl and X e C . Define A_„ = E[X|^4._„]. Then A_oo = lim„_ i . 00 X_„ 
converges in £ 1 and X-^ = E[X\A-oo]- 



Proof. Apply Doob's up-crossing lemma to the uniformly integrable mar- 
tingale 

Xk,-n <k<-l : 
for all a < b, the number of up-crossings is bounded 

U k [a,b] < (|a| + ||X||i)/(6-o) . 

This implies in the same way as in the proof of Doob's convergence theorem 
that lim JWOO A_ n converges almost everywhere. 

We show now that A_oo = E[X |>t_oo]: given A e A-oo- We have E[X; A] = 
E[A_„; A] = E[X-oo;A]. The same argument as before shows that A_oo = 
ElXlA-oc}. □ 
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Lets also look at a martingale proof of the strong law of large numbers. 



Corollary 3.4.5. Given X n G C 1 which are IID and have mean to. Then 
S n /n m in £ . 



Proof. Define the downward filtration A— n = o~(S n , S n +±, . . .). 
Since E[Xi\A-„] = V[Xi\A- n ] = E[Xi\S n , S n+ i, ...] =-*i, andE[Xi|^„] = 
Sn/n. We can apply Levy's downward theorem to see that S n /n converges 
in C 1 . Since the limit X is in T, it is by Kolmogorov's 0-1 law a constant 
c and c = ELY] = linin^oo ~E[S n /n] = to. □ 



3.5 Doob's decomposition of a stochastic process 

Definition. A process X n is increasing, if P[X n < X n+ i] = 1. 



Theorem 3.5.1 (Doob's decomposition). Let X n be an _4„-adapted C 1 - 
process. Then 

X = X + N + A 

where N is a martingale null at and A is a previsible process null at 0. 
This decomposition is unique in L . X is a submartingale if and only if A 
is increasing. 



Proof. If X has a Doob decomposition X = Xq + N + A, then 
ELY n -Jf n _i|.A„_i] - E[JV n -JV n _i|A l ]+E[A l -A l _ 1 |A l _i] = A n -A n _! 
which means that 

n 

A n = J2HXk-X k - 1 \A„-i} . 

k=l 

If we define A like this, we get the required decomposition and the sub- 
martingale characterization is also obvious. □ 

Remark. The corresponding result for continuous time processes is deeper 
and called Doob-Meyer decomposition theorem. See theorem (4.17.2). 
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Lemma 3.5.2. Given s,t,u,v <G N with s < t < u < v. If X n is a C 2 - 
martingale, then 

E[(X t -X s )(X v -X u )]=0 

and 

n 

ELY*] = ELY 2 ] + J2 m*k ^-i) 2 ] ■ 
fc=l 



Proof. Because ELY„ — X U |^4 U ] = X u — X u = 0, wc know that X„ — X u 
is orthogonal to £ 2 (A U ). The first claim follows since X t — X x £ £ 2 ( v 4„). 
The formula 

n 

Y n = X + ^(Xfc — Xfe-i) 
fe=l 

expresses Y„ as a sum of orthogonal terms and Pythagoras theorem gives 
the second claim. □ 



Corollary 3.5.3. A £ -martingale X is bounded in L 2 if and only if 

££LiE[(*fc-x fc -i) 2 ] <oo. 



Proof. 

n oo 

ELY 2 ] = ELY 2 ] + ^E[(X,-X fe _i) 2 ] < E[X 2 ]+^E[(X fc -X fe _!) 2 ] < oo . 
fe=i fc=i 

If on the other hand, X n is bounded in C 2 , then ||X„||2 < K < oo and 
^Ep^-X^) 2 ] </v+E[X 2 ]. □ 



Theorem 3.5.4 (Doob's convergence theorem for L 2 -martingalcs). Let X n 
be a £ 2 -martingale which is bounded in C 2 , then there exists X S C 2 such 
that X„ ->• X in £ 2 . 



Proof. If X is bounded in £ 2 , then, by monotonicity of the norm ||X||i < 
| \X\ I2, it is bounded in C 1 so that by Doob's convergence theorem, X n — > X 
almost everywhere for some X. By Pythagoras and the previous corol- 
lary (3.5.3), we have 

E[(X-X„) 2 ]< E[(X fe -X fe _ x ) 2 ]^0 

fc>n+l 



3.5. Doob's decomposition of a stochastic process 
so that X n — > X in C 2 . 
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□ 



Definition. Let X n be a martingale in C 2 which is null at 0. The conditional 
Jensen's inequality (3.1.4) shows that X 2 is a submartingalc. Doob's de- 
composition theorem allows to write X 2 = N + A, where N is a martingale 
and A is a previsible increasing process. Define A^ = limn^oo A n point 
wise, where the limit is allowed to take the value oo also. One writes also 
(X) for A so that 

X 2 = N + (X) . 



Lemma 3.5.5. Assume X is a £ 2 -martingale. X is bounded in C 2 if and 
only if E[(X}oo] < oo. 



Proof. From X 2 = N + A, we get ELY 2 ] = E[A n ] since for a martingale N, 
the equality E[N n ] = E[N ] holds and N is null at 0. Therefore, X is in C 2 
if and only if E[A oc ] < oo since A n is increasing. □ 

We can now relate the convergence of the process X n to the finiteness of 

Aqo = (Y') oc : 



Proposition 3.5.6. Assume \\X n — Xn-iHoo < K for all n. Then 
lim n _j. 00 X n (uj) converges if and only if Aoo < oo. 



Proof, a) We first show that A^^) < oo implies that huin^oo X n (uj) 
converges. Because the process A is previsible, we can define for every k 
a stopping time S(k) = mf{n <G N | A n+ i > k }. The assumption shows 
that for almost all w there is a k such that S(k) = oo. The stopped process 
A s ^ is also previsible because for B £ Br and neN, 

{AiAS(fc) € B } = C\ U C-2 

with 

d = \J{S(k) = i;A i eB}eA n - 1 

i=0 

C 2 = {A n efi)n {S(k) < n - ly 6 ^„_i . 

Now, since 

(x S(fc) )2 _ = (X 2 _ ^S(fc) 

is a martingale, we see that (X s( - k ">) = A s( - k \ The later process A s( - k ^ 
is bounded by k so that by the above lemma X s ^ is bounded in C 2 
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and lim„ X s ^ (uj) = lim„ X nA s(k) (w) exists almost everywhere. But since 
S(k) = oo almost everywhere, we also know that lim„ X n (ui) exists for 
almost all u>. 

b) Now we prove that the existence of lim rl _j. 00 X n (ui) implies that Aoo{ui) < 
oo almost everywhere. Suppose the claim is wrong and that 

P[Aoo = oo,sup \X n \ < oo ] > . 

n 

Then, 

P[T(c) = oo;A x =oo] >0 , 
where T(c) is the stopping time 

T(c) = inf{n | \X n \ > c } . 

Now 

E [ X T(c)/\n - A T(c)An] = 

and X T ^ is bounded by c + K. Thus 

EL4 T(c)An ] < {c + Kf 
for all n. This is a contradiction to PLAoo = oo, sup n |A„| < oo] > 0. □ 

Example. If Yj- is a sequence of independent random variables of zero mean 
and standard deviation o~k- Assume ||Yfc||oo < K arc bounded. Define the 
process X n = F fe . Write ^ = N n + A n with A n = ££ =1 E K 1 = 

Y^k = i a k an d A?« = S 2 — A n . In this case A„ is a numerical sequence and not 
a random variable. The last proposition implies that X n converges almost 
everywhere if and only if X3fe=i a \ converges. Of course we know this also 
from Pythagoras which assures that Var[A„] = Y^k=i Var[Yfc] = X)fe=i a \ 
and implies that X n converges in C 2 . 



Theorem 3.5.7 (A strong law for martingales). Let A be a £ 2 -martingale 
zero at and let A = (A) . Then 



An 



almost surely on {A^ = oo }. 



Proof, (i) Cesaro's lemma: Given = bo < b\ < . . . , b n < b n+ i — > oo and a 
sequence v n <G K which converges v n — > Uqo, then y- Y^k=i(^k — &fc-i) w fc ~~ ^ 

Woo- 
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Proof. Let e > 0. Choose m such that Vk > Voo — e ii k > m. Then 

I ™ i m 

liminf — }(bk - b k -i)v k > liminf — }(b k - b k -i)v k 

" fc=l ™ fc=l 

H r ( u oo - e) 

On 

> + «oo - e 

Since this is true for every e > 0, we have liminf > Woo. By a similar 
argument limsup > Voq. □ 

(ii) Kronecker's lemma: Given = bo < b\ < . . . , b n < b n+ i — > oo and a se- 
quence x n of real numbers. Define s n = x\ H +x„. Then the convergence 

of ti„ = Z)fe=i ^fe/^fe implies that s„/6„ -> 0. 

Proof. We have u„ — = x n /b n and 

n n 

s n = ^M"fc - Uk-i) = b n u n - ^{bk - bk-i)u k -i ■ 

k=l k=l 

Cesaro's lemma (i) implies that s n /b n converges to — = 0. □ 

(iii) Proof of the claim: since A is increasing and null at 0, we have A n > 
and 1/(1+ A n ) is bounded. Since A is previsible, also l/{\+A n ) is previsible, 
we can define the martingale 



(f(l + A)^dX) n ^Y,— 1 



Xk — Xk-i 



A k 

k=l 



Moreover, since (1 + A n ) is „4„_i-measurablc, we have 

mWn-Wn-rflAn-t] = (l+A n )~ 2 (A n - A n _ ± ) < (1+A„_ 1 )- 1 -(1+^„)~ 1 

almost surely This implies that (W)oo < 1 so that linin^oo W n exists 
almost surely. Kronecker's lemma (ii) applied point wise implies that on 
{A x = oo} 

lim X n /(1 + An) = lim X n /A n -> . 

n— >oo n— J-oo 

□ 

3.6 Doob's submartingale inequality 

We still follow closely [113]: 



Theorem 3.6.1 (Doob's submartingale inequality). For any non- negative 
submartingale X and every e > 

e ■ P[ sup X k > e] < E[X n ; { sup X k > e}] < E[X n ] . 

Kk<n Kk<n 
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Proof. The set A = {sup 1<fc< „ X k > e} is a disjoint union of the sets 

A = {X >e}€Ao 
fe-i 

A k = {x k >e}n{f] AD g A k . 

i=0 

Since X is a submartingale, and Xj. > e on we have for k < n 

E[X n ;A k ] >E[X k ;A k ] > eP[A k ] . 

Summing up from k = to n gives the result. □ 

We have seen the following result already as part of theorem (2.11.1). Here 
it appears as a special case of the submartingale inequality: 

Theorem 3.6.2 (Kolmogorov's inequality). Given X n 6 C 2 IID with 
E[Xi] = and S n = £Li x k- Then for e > 0, 

Var[S„] 



P[ sup |S fc | >e] < 



"l<fc<n ' ' e 2 



Proof. S„ is a martingale with respect to A n = o~(Xi , X2, ■ ■ ■ , X n ). Because 
u(x) = x 2 is convex, S 2 is a submartingale. Now apply the submartingale 
inequality (3.6.1). □ 

Here is an other proof of the law of iterated logarithm for independent 
./V(0, 1) random variables. 



Theorem 3.6.3 (Special case of law of iterated logarithm). Given X n IID 
with standard normal distribution iV(0, 1). Then limsup^^^ S n /A(n) = 1. 



Proof. We will use for 

/>oo />oo 

1 - $(x) = / 4>{y) dy= (27T)- 1 / 2 cxp(-j, 2 /2) dy 

the elementary estimates 

(x + x _1 )"V(^) < 1 - < x _1 <?!>(x) . 
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(i) S n is a martingale relative to A n = &{Xi, • ■ • ,X n ). The function x i— > 
e 6x is convex on R so that e e ™ is a submartingale. The submartingale 
inequality (3.6.1) gives 



P[ sup S k > e] = P[ sup e eSfc > e 6t ] < e - e£ E[e es "] = e 

Kfc<n Kfc<n 



-fie -n/2 



For given e > 0, we get the best estimate for 9 ~ e/n and obtain 
P[ sup 5 fc > e] < e ~ e2 /(2«) _ 

l<k<n 

(ii) Given A' > 1 (close to 1). Choose e„ = AA(A n_1 ). The last inequality 
in (i) gives 



P[ sup S k > e n ] < exp(-e 2 /(2A™)) = (n - l)- K (log A)" 

Kfc<A™ 



A 



The Borcl-Cantclli lemma assures that for large enough n and A™ 1 < k < 
A™ 

S fc < sup S*fe < e„ = A'A(A™ _1 ) < AA(fc) 

l<fe<A" 

which means for K > 1 almost surely 

lim sup k < A' . 

fc— >oo -<HfC) 

By taking a sequence of A's converging down to 1, we obtain almost surely 

lim sup k < 1 . 
fc^oo A(«J 

(iii) Given A > 1 (large) and (5 > (small). Define the independent sets 
An = {S{N n+1 ) - S(N n ) > (1 - <S)A(A' l+1 - A™)} . 

Then 

P[AJ = 1 - $(y) = (2 7 r)- 1 / 2 (y + j," 1 )-^-^/ 2 

with y = (1 — <5)(21og log(A" _1 — A")) 1 / 2 . Since PL4 n ] is up to logarithmic 
terms equal to (nlog A) - ' 1 ^ . we have ^ P[i„] = oo. Borel-Cantelli 
shows that P[limsup n A n ] = 1 so that 

S(N n+1 ) > (1 - 5)A{N n+1 - A") + S(N n ) . 

By (ii), S(N n ) > — 2A(A") for large n so that for infinitely many n, we 
have 

S(N n+1 ) > (1 - <5)A(A n+1 - A") - 2A(A") . 

It follows that 

l imsup |i > S (" n+ *\ > (1 - <5)(1 - i)V2 _ . 

n A n " n A(A" +1 ) " A 

□ 
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3.7 Doob's C p inequality 



Lemma 3.7.1. (Corollary of Holder inequality) Fix p > 1 and q satisfying 
p- 1 + q- 1 = 1. Given X, Y g £ p satisfying 

eP[|X| > e] < E[|K|; |X| > e] 

Ve>0, then \\X\\ p < q ■ \\Y\\ p . 



Proof. Integrating the assumption multiplied with pe p 2 gives 

poo poo 

L= pe p - 1 P[\X\>e]de< pe p - 2 E[\Y\; \X\ > e] de =: R . 
Jo Jo 

By Fubini's theorem, the the left hand side is 

/>oo />oo 

L= / E[peP- 1 l {lxl > t} ]de = E[ pe^H^^de] = E[\X\*>] . 
Jo Jo 

Similarly, the right hand side is R = E[q ■ |X|P _1 |F|]. With Holder's in- 
equality, we get 

EunKEiqixr^^qn-wixrx. 

Since (p - l)g = p, we can substitute 1 1 l^l^" 1 1 |g. = EllX^] 1 / 9 on the right 
hand side, which gives the claim. □ 



Theorem 3.7.2 (Doob's L p inequality). Given a non- negative submartingale 
X which is bounded in C p . Then X* = sup n X n is in C p and satisfies 

\\X*\\ p <q-suv\\X n \\ p . 



Proof. Define X* = sup 1<fe<n Xk for n g N. From Doob's submartingale 
inequality (3.6.1) and the above lemma (3.7.1), we see that 

\\X*\\ P < g||-X" re || P < <?sup ||-X" n || p . 

n 



□ 
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Corollary 3.7.3. Given a non-negative submartingale X which is bounded 
in £ p . Then = lrnin^oo X n exists in C and H-XooHp = lim,^,^ . 



Proof. The submartingale X is dominated by the element X* in the £ p - 
inequality. The supermartingale —X is bounded in CP and so bounded in 
£ . We know therefore that Xoo = lim I n. 00 X n exists almost everywhere. 
From \\X n — -X"oo||p < (2X*) P e C p and the dominated convergence theo- 
rem (2.4.3) we deduce X n —> X^ in CP . □ 



Corollary 3.7.4. Given a martingale Y bounded in C p and X = \Y\. Then 

= lim X„ 

n— >oo 

exists in £ p and H-XooUp = lim^oo ||X„|| P . 



Proof. Use the above corollary for the submartingale X = \ Y\. □ 



Theorem 3.7.5 (Kakutani's theorem). Let X n be a non- negative indepen- 
dent C 1 process with E[A"„] = 1 for all n. Define 5*0 = 1 and S n = Ilfe=i 
Then Soo = lim„ S n exists, because S n is a nonnegative C 1 martingale. 
Then S n is uniformly integrable if and only if JI^Li EfA', 1 / 2 ] > 0. 



Proof. Define a n = E[X„ /2 ]. The process 

Y l/2 Y l/2 Y l/2 
_ A 1 A 2 A ra 

J-n — ' ' ' 

a>i a 2 a„ 

is a martingale. We have E[T^] = (aia2 ■ • • a Tl )~ 2 < (J\ n a„) _1 < oo so that 
T is bounded in £ 2 , By Doob's £ 2 -inequality 

E[sup|5„|] < E[sup|T„| 2 ] < 4supE[|T„| 2 ] < oo 

n n n 

so that S is dominated by S* = sup n £ £ . This implies that S is 
uniformly integrable. 

If S n is uniformly integrable, then S n — > in £ . We have to show that 
Il^Li a n > 0- Aiming to a contradiction, we assume that FJ o„ = 0. The 
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martingale T denned above is a nonnegative martingale which has a limit 
Too • But since J\ a n = we must then have that Soo = and so S n — > 
in £ . This is not possible because E[S n ] = 1 by the independence of the 
X n . □ 

Here are examples, where martingales occur in applications: 

Example. This example is a primitive model for the Stock and Bond mar- 
ket. Given a < r < b < oo real numbers. Define p = (r — a)/{b — a). Let e n 
be IID random variables taking values 1,-1 with probability p respectively 
1—p. We define a process B n modeling bonds with fixed interest rate / and 
a process S n representing stocks with fluctuating interest rates as follows: 

B n = (l+r)"S n _i,S = 1 , 

S n = (1 + Rn)S n -l, So = 1 , 

with R n — (a + 6)/2 + e n (a — b)/2. Given a sequence A n , your portfolio, 
your fortune is X n and satisfies 

X n = (1 + r)X n _! + A n S n -i(Rn, - r) . 
We can write R n — r = h(b — a){Z n — Z n _i) with the martingale 

n 

Z n = ^(e fe -2p+l) . 

fe=i 

The process Y n = (1 + r)^"I„ satisfies then 

Yn-Y n -x = {l + r)- n A n S n - l {R n -r) 

= 1(6 - o)(l + r)- n A n S n ^(Z n - Z n _i) 

= C n (Z n — Z n —\) 

showing that Y is the stochastic integral / C dZ. So, if the portfolio A n is 
previsible which means by definition that it is A n -i measurable, then Y is 
a martingale. 

Example. Let X, Xi,X2-.- be independent random variables satisfying 
that the law of X is N(0, a 2 ) and the law of Xk is N(0, ct|). We define the 
random variables 

Y k = X + X k 

which we consider as a noisy observation of the random variable X. Define 
A n = cr{X\, . . . , X n ) and the martingale 

M n = E[X\A n ] . 

By Doob's martingale convergence theorem (3.5.4), we know that M n con- 
verges in C 2 to a random variable A/qo- One can show that 



E[(X - M n ) 2 } = (a- 2 + & 



-2\-l 
k 

k=l 



This implies that X = if and only if J2 n a n 2 = 00 • ^ the noise grows 
too much, for example for a n = n, then we can not recover X from the 
observations Y k . 
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3.8 Random walks 

We consider the d-dimcnsional lattice Z d where each point has 2d neighbors. 
A walker starts at the origin € Z d and makes in each time step a random 
step into one of the 2d directions. What is the probability that the walker 
returns back to the origin? 

Definition. Define a sequence of IID random variables X n which take values 
in 

d 

/ = {e G Z d | |e| = £ M = 1 } 

i=i 

and which have the uniform distribution defined by P[X n = e] = (2d) -1 
for all e £ I. The random variable S n = Yl^—x Xi with So = describes 
the position of the walker at time n. The discrete stochastic process S n is 
called the random walk on the lattice Z d . 



Figure. A random walk sample 
path Si(ui), . . . , S n (ui) in the lat- 
tice 1? after 2000 steps. B n (u) 
is the number of revisits of the 
starting points 0. 




As a probability space, we can take f2 = I N with product measure i^ N , 
where v is the measure on E, which assigns to each point e the probability 
v({e}) = (2d)~ 1 . The random variables X n arc then defined by X n (uj) — 
uj n . Define the sets A n = {S n = } and the random variables 

Y n = U„ ■ 

If the walker has returned to position € Z d at time n, then Y n = 1, 
otherwise Y n = 0. The sum B n = X)fc=o counts the number of visits of 
the origin of the walker up to time n and B = X^feLo counts the total 
number of visits at the origin. The expectation 

oo 

E[B] =^P[S„ = 0] 

n=0 

tells us how many times the walker is expected to return to the origin. We 
write E[_B] = oo if the sum diverges. In this case, the walker returns back 
to the origin infinitely many times. 
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Theorem 3.8.1 (Polya). E[B] = oo for d = 1, 2 and E[B] < oo for d > 2. 



Proof. Fix n e N and define a^(fc) = P[S„ = jfe] for k g 1 d . Because 
the walker can reach in time n only a bounded region, the function : 
Z d — >• R is zero outside a bounded set. We can therefore define its Fourier 
transform 

<t>S n (x) = a (n) (k)e 2mk - x 

which is smooth function on T d = M. d /Z, d . It is the characteristic function 
of S n because 

E[e" s „] = ^ pis n = k]e ik ' x . 

k£Z d 

The characteristic function <j>x of Xk is 

1 1 d 

Because the 5 n is a sum of n independent random variables Xj 

1 d 

<Ps n = 0x 1 (x)4>x 2 (x)... (j) Xn (x) = cos(2ttx 4 ))™ ■ 

i=l 

Note that <As„(0) = P[S„ = 0]. 

We now show that E[B] = X)n>o ^s„(0) is finite if and only if d < 3. The 
Fourier inversion formula using the normalized Volume mesure dx on T 3 
gives 

X>[s„ = o] = / f; ^(x) & = / ! & . 

„ J T *i-<i>x(x) 

a; 2 

A Taylor expansion (f>x( x ) = 1 — _ 2"(^ 7r ) 2 + • * • shows 
The claim of the theorem follows because the integral 



/ — 

J{\x\<e} \x\ 2 



dx 

over the ball of radius e in M. d is finite if and only if d > 3. □ 
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Corollary 3.8.2. The walker returns to the origin infinitely often almost 
surely if d < 2. For d > 3, almost surely, the walker or rather bird returns 
only finitely many times to zero and Pprnij^oo \S n \ = oo] = 1. 



Proof. If d > 2, then Aoo = limsup„ A n is the subset of ft, for which the 
particles returns to infinitely many times. Since E[B] = X^^Lo^I^™]' 
the Borel-Cantelli lemma gives P[Aoo] = for d > 2. The particle returns 
therefore back to only finitely many times and in the same way it visits 
each lattice point only finitely many times. This means that the particle 
eventually leaves every bounded set and converges to infinity. 
If d < 2, let p be the probability that the random walk returns to 0: 

p = piLK]- 

n 

Then is the probability that there are at least m visits in and the 

probability is p" 1 ^ 1 — p m = p" i_1 (l — p) that there are exactly m visits. We 
can write 

E[B}= Y,rap m -\l-p) = ^— . 

m>l & 

Because ~E[B] = oo, we know that p = 1. □ 

The use of characteristic functions allows also to solve combinatorial prob- 
lems like to count the number of closed paths starting at zero in the graph: 



Proposition 3.8.3. There are 

~ d 

(2d)" / (V cos(27rx fe )) n dx! ■ ■ ■ dx d 
J ^ d fe =i 

closed paths of length n which start at the origin in the lattice Z d . 



Proof. If we know the probability P[S n = 0] that a path returns to in n 
step, then (2d) n P[S n = 0] is the number of closed paths in 1 d of length n. 
But V\S n = 0] is the zero'th Fourier coefficient 

/ 4> Sn (x) dx = / cos(27TXfc))" dx 

Jf d JT d , , 

of 4>s n , where dx = dx\ ■ ■ ■ dxd- □ 
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Example. In the case d = 1, we have 

/ 2 2 ™cos 2 "(27ra) dx = 
Jo 

closed paths of length 2n starting at 0. We know that also because 




For n = 2 for example, we have 2 2 cos(27nc) 2 dx = 2 closed paths of 
length 2 which start at in Z. 

The lattice Z d can be generalized to an arbitrary graph G which is a regular 
graph that is a graph, where each vertex has the same number of neighbors. 
A convenient way is to take as the graph the Cayley graph of a discrete 
group G with generators ax, ... , ad- The random walk can also be studied 
on a general graph. If the degree is d at a point x, then the walker choses 
a random direction with probability 1 / d. 




Corollary 3.8.4. If G is the Cayley graph of an Abelian group Q then the 
random walk on G is recurrent if and only at most two of the generators 
have infinite order. 



Proof. By the structure theorem for Abelian groups, an Abelian group Q 
is isomorphic to Z fc x Z ni x . . . Z„ d . The characteristic function of X n is a 
function on the dual group Q 

oo oo „ oo „ /* I 

J2nSn = 0] = J2 I = £ LFxix) dx= / ___*» 
n=0 n=Q J y n=0 J y J( d 9X[ ' 

is finite if and only if Q contains a three dimensional torus which means 
k > 2. □ 

The recurrence properties on non- Abelian groups is more subtle, because 
characteristic functions loose then some of their good properties. 



Example. An other generalization is to add a drift by changing the prob- 
ability distribution v on /. Given pj £ (0,1) with 5Zui =1 Pj = 1. In this 

case 

b'l=i 

We have recurrence if and only if 

j dx = oo . 

d 1 - <px{x) 
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Take for example the case d = 1 with drift parameterized by p € (0,1). 
Then 

4>x{x) = pe 27lix + (1 - p)e~ 27Tlx = cos(27rx) + i{2p - 1) sin(27rx) . 
which shows that 



if p 7^ 1/2. A random walk with drift on 7L d will almost certainly not return 
to infinitely often. 

Example. An other generalization of the random walk is to take identically 
distributed random variables X n with values in /, which need not to be 
independent. An example which appears in number theory in the case d = 1 
is to take the probability space fl = T 1 = R/Z, an irrational number a and 
a function / which takes each value in / on an interval [A, ^r). The 
random variables X n (u)) = f(tu + no) define an ergodic discrete stochastic 
process but the random variables arc not independent. A random walk 
S n = Ylk=i with random variables X k which are dependent is called a 
dependent random walk. 



Figure. If Y k are IID random 
variables with uniform distri- 
bution in [0, a], then Z n = 
2fe_i Yk mod 1 are dependent. 



Define X k = (1,0) if Z k G 
[0,1/4), X k = (-1,0) if Z k e 
[1/4,1/2), X k = (0,1) if z k e 
[1/2,3/4) and X k = (0,-1) if 
Z k G [3/4,1). Also X k are no 



more independent. For small a, - 1 1 — 

there can belong intervals, where — — 

X k is the same because Z k stays _i 

in the same quarter interval. The _zi+ L 
picture shows a typical path of 
the process S n — J2k=i X k- 

Example. An example of a one-dimensional dependent random walk is the 
problem of "almost alternating sums" [53]. Define on the probability space 
fl = ([0, 1], A, dx) the random variables X n (x) = 21[ .i/2](a; + no) — 1, 
where a is an irrational number. This produces a symmetric random walk, 
but unlike for the usual random walk, where S n (x) grows like ^/n, one sees 
a much slower growth S n (0) < log(n) 2 for almost all a and for special 
numbers like the golden ratio (VE+ l)/2 or the silver ratio \[2 + 1 one has 
for infinitely many n the relation 




a ■ log(n) + 0.78 < S n (0) < a ■ log(n) + 1 
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with a = 1/(2 log(l + \/2)). It is not known whether S n (0) grows like Iog(n) 
for almost all a. 



Figure. An almost periodic ran- 
dom walk in one dimensions. In- 
stead of flipping coins to decide 
whether to go up or down, one 
turns a wheel by an angle a after 
each step and goes up if the wheel 
position is in the right half and 
goes down if the wheel position is 
in the left half. While for periodic 
a the growth of S n is either lin- 
ear (like for a — 0), or zero (like 
for a = 1/2), the growth for most 
irrational a seems to be logarith- 




3.9 The arc-sin law for the ID random walk 

Definition. Let X n denote independent {—1,1 }-valued random variables 
with P[X n = ±1] = 1/2 and let S n = Ysk=i be the random walk. We 
have seen that it is a martingale with respect to X n . Given a £ Z, we define 
the stopping time 

T a = min{n £ N | S n = a } . 



Theorem 3.9.1 (Reflection principle). For integers a, b > 0, one has 

P[a + S n = b, T_ < n] = P[S n = a + b] . 



Proof. The number of paths from a to b passing zero is equal to the number 
of paths from —a to 6 which in turn is the number of paths from zero to 
a + b. □ 



3.9. The arc-sin law for the ID random walk 



175 



Figure. The proof of the reflec- 
tion principle: reflect the part of 
the path above at the line 0. To 
every path which goes from a to 
b and touches there corresponds 
a path from —a to b. 




The reflection principle allows to compute the distribution of the random 
variable T_ a : 



Theorem 3.9.2 (Ruin time). We have the following distribution of the stop- 
ping time: 

a) P[T_ a < n] = P[S n < -a] + P[S n > a}. 

b) P[T_ a = n] = %P[S n = a]. 



Proof, a) Use the reflection principle in the third equality: 
P[T_ a <n] = ^P[T_ a <n,a + S n = b] 

= P[a + S n =b}+ Y, P i T ~* <n,a + S„ = b] 

6<0 b>0 

= ^P[a + S„=6]+^P[S„ = a + 6] 



fc<0 6>0 

P[S n < -a}+P[S n >a] 



b) From 



we get 



Also 



P[Sn = a] 



n 

2 



-P[S n = a} = ±(p[S n -i = a - 1] - P[5 n _i = a + 1]) 
n 2 



P[S n > a] - P[S n -x > a] = P[S n > a , S„_ x < a] 

+P[S n > a , Sn-! > a}- P[S n -i > a] 

= l(P[5„_i = a]-P[5„_i = a + l]) 
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and analogously 

P[S n < -a] - P[5„_x < -a] = i(P[5„_! = a - 1] - P[S n ^ = a]) . 

Therefore, using a) 

P[T_ a = n] = P[T- a <n}-P[T_ a <n-l] 
= P[S n <-a}-P[S n ^<-a} 
+ P[S n > a] - P[5„_i > a] 

= \^{S n -\ = a] - P[5„_i = a + 1]) 

+ l{P[Sn-i = a - 1] - P^n-i = a]) 

= o( P [ 5 n-l = « - 1] - P^n-l = a + 1]) = -P[5„ = a] 
2 n 

□ 

Theorem 3.9.3 (Ballot theorem). 

P[5„ = a , Si > 0, . . . , 5„_i > 0] = - • P[S n = a] . 



Proof. When reversing time, the number of paths from to a of length n 
which do no more hit is the number of paths of length n which start in 
a and for which T_ a = n. Now use the previous theorem 

P[T- a = n] = ^P[S n =a) . 

□ 



Corollary 3.9.4. The distribution of the first return time is 
P[T > 2n] = P[S 2n = 0] . 



Proof. 

P[T > 2n] = ip[T_! >2n-l}+ ip[Ti > 2n - 1] 

= P[T_x > 2n — 1] (by symmetry) 

= P[5 2n -i>-land5 2ll _i<l] 

= P[S 2n -i €{0,1}] 

= P[S 2 „-i = 1] = P[S 2n = 0] . 
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□ 

Remark. We see that lim„_ J . 00 P [To > 2n] = 0. This restates that the 
random walk is recurrent. However, the expected return time is very long: 

oo oo oo 

E[T ] = J2 nP i T o = n] = J2 P t T o > n l = J2 P t 5 « = 0] = oo 

n— n— n— 

because by the Stirling formula n\ ~ n n e~ n y/2im, one has 
2 2 "/v / 7m and so 

p[^ = o] = ( 2 ;)i-(-r 1/2 - 

Definition. We are interested now in the random variable 
L{uj) = max{0 < n < 2N \ S n (uj) = } 

which describes the last visit of the random walk in before time 2N. If 

the random walk describes a game between two players, who play over a 
time 2N, then L is the time when one of the two players does no more give 
up his leadership. 




Theorem 3.9.5 (Arc Sin law). L has the discrete arc-sin distribution: 



and for N — > oo, we have 







f 2N- 


2n \ 


2 2N ( 




{ N- 


n ) 



P[^ < z \ -> ^arcsin(VI) 



Proof. 

P[L = 2n] = P[S 2n = 0] • P[T Q > 2N - 2n] = P[S 2n = 0] • P[SW-2„ = 0] 

which gives the first formula. The Stirling formula gives P[S2fc = 0] ~ 
so that 

P[L = 2k] = - - 1 = -/(-) 

with 

m = A — T ■ 
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It follows that 

L f z 2 

PItttt < z \ -> / f( x ) dx=- arcsin(Vi) . 

□ 




-0.2 0.2 0.4 0.6 0.8 1 



Figure. The distribution function 
P[L/2N < z] converges in the 
limit N — > oo to the function 
2arcsin(y / i)/7r. 



3.5 

3 




. s 



-0.2 0.2 0.4 0.6 0.8 1 

Figure. The density function of 
this distribution in the limit N — > 
oo is called the arc-sin distribu- 
tion. 



Remark. From the shape of the arc-sin distribution, one has to expect that 
the winner takes the final leading position cither early or late. 

Remark. The arc-sin distribution is a natural distribution on the interval 
[0, 1] from the different points of view. It belongs to a measure which is 
the Gibbs measure of the quadratic map x H> 4 ■ x(l — x) on the unit 
interval maximizing the Boltzmann-Gibbs entropy. It is a thermodynamic 
equilibrium measure for this quadratic map. It is the measure fj, on the 
interval [0, 1] which minimizes the energy 

I{H) = - f f log \E - E'\ d(i(E) dfi(E r ) . 
Jo Jo 

One calls such measures also potential theoretical equilibrium measures. 



3.10 The random walk on the free group 

Definition. The free group Fd with d generators is the set of finite words 
w written in the 2d letters 

A = {ai,a 2 ,...,a (i ,a^ 1 ,a^ 1 ,...,a^ 1 } 
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modulo the identifications aid' 1 = a" 1 a; = 1. The group operation is 
concatenating words v o w = vw. The inverse of w = w\W2 ■ ■ ■ w n is w^ 1 = 
w n 1 ' ' ' W2 Wi • Elements yj in the group Fd can be uniquely represented 
by reduced words obtained by deleting all words vv" 1 in w. The identity 
e in the group Fd is the empty word. We denote by l(w) the length of the 
reduced word of w. 

Definition. Given a free group G with generators A and let Xk be uniformly 
distributed random variables with values in A. The stochastic process S n — 
Xi ■ ■ ■ X n is called the random walk on the group G. Note that the group 
operation Xk needs not to be commutative. The random walk on the free 
group can be interpreted as a walk on a tree, because the Cayley graph of 
the group Fd with generators A contains no non-contractiblc closed circles. 



Figure. Part of the Cayley graph 
of the free group F2 with two gen- 
erators a, b. It is a tree. At ev- 
ery point, one can go into 4 dif- 
ferent directions. Going into one 
of these directions corresponds to 
multiplying with a,a _1 ,fe or b^ 1 . 



+■ 


■+ 


■+ 

+- 


■+ 

-i- 




■+ 

+- 


■1- 

■+ 


■+ 



Definition. Define for n € N 

r n = P[5„ = e , Sijte,S 2 ^e,... S n -i e] 

which is the probability of returning for the first time to e if one starts at 
e. Define also for n £ N 

m n = P[S n = e] 

with the convention to*- ) = 1. Let r and m be the probability generating 
functions of the sequences r n and m n : 

00 00 
m(x) = ^2 m nX n , r(x) = ^ r ™ x ™ ■ 

n=0 n=a 

These sums converge for |a;| < 1. 



Lemma 3.10.1. (Feller) 

, , 1 
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Proof. Let T be the stopping time 

T = wm{n e N | S n = e} . 

With P[T = n] = r„, the function r(x) = Y^Li r nX n is the probability 
generating function of T. The probability generating function of a sum in- 
dependent random variables is the product of the probability generating 
functions. Therefore, if Tj are independent random variables with distribu- 
tion T, then X/i=i -^i ^ as the probability generating function x n- r n (x). 
We have 



e, . . . , Sjjj. e, 
1 

1 — r(x) 

□ 

Remark. This lemma is true for the random walk on a Cayley graph of any 
finitely presented group. 

The numbers r2„+i are zero for odd 2n + l because an even number of steps 
are needed to come back. The values of ri n can be computed by using basic 
combinatorics: 



Lemma 3.10.2. (Kesten) 



Proof. We have 

Tin = ( 2t f)2» l{ W l U '2 ■ ■ • W 2n &G,W k = WiW 2 ...1Ufc 7^ e}| . 

To count the number of such words, map every word with 2n letters into 
a path in Z 2 going from (0, 0) to (n, n) which is away from the diagonal 
except at the beginning or the end. The map is constructed in the following 
way: for every letter, we record a horizontal or vertical step of length 1. 
If l(w k ) = Z 1 ) + 1, wc record a horizontal step. In the other case, if 
l(w k ) = l(w k ~ 1 ) — 1, we record a vertical step. The first step is horizontal 
independent of the word. There are 

1 ( 2n - 2 \ 
n V n-1 ) 



m n x r ' 

?i=0 



n=0 



/] P[S ni —e,S n2 

n—0 0<ni<n2<-"<nfc 

S n ^ e for n $ {rii, . . . ,n k }}x n 



oo n 



n=0 k=l 
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such paths since by the distribution of the stopping time in the one dimen- 
sional random walk 

P[T 2n _! = 2n - 1] = — !— - -P[52„_i = 1] 

Zn — 1 

_L / 2n - 1 \ 

2n - 1 V n ) 
1 ( 2n - 2 \ 
n V n - 1 J ' 

Counting the number of words which are mapped into the same path, we see 
that we have in the first step 2d possibilities and later {2d — 1) possibilities 
in each of the n — 1 horizontal step and only 1 possibility in a vertical step. 
We have therefore to multiply the number of paths by 2d(2d — l) 2n_1 . □ 



Theorem 3.10.3 (Kesten). For the free group Fd, we have 



(d - 1) + yjd? - {2d- l)x 2 ' 



Proof. Since we know the terms r2 n we can compute 

, , d-Jd 2 - {2d-l)x 2 
T[X) = 2d— I 

and get the claim with Feller's lemma m{x) = 1/(1 — r(x)). □ 

Remark. The Cayley graph of the free group is also called the Bethe lattice. 
One can read of from this formula that the spectrum of the free Laplacian 
L : l 2 {Fd) — > l 2 {Fd) on the Bethe lattice given by 

Lu (<]) = ^2u{g + a) 
is the whole interval [—a, a] with a = 2\/2d — 1. 



Corollary 3.10.4. The random walk on the free group Fd with d generators 
is recurrent if and only if d = 1 . 



Proof. Denote as in the case of the random walk on Z d with B the random 
variable counting the total number of visits of the origin. We have then 
again E[B] = J2 n P[S n = e] = J2 n m « = 7Tl (l)- We see that for d = 1 we 



182 



Chapter 3. Discrete Stochastic Processes 



have m(l) = oo and that m(d) < oo for d > 1. This establishes the analog 
of Polya's result on Z d and leads in the same way to the recurrence: 

(i) d = 1: We know that Z\ = Fx, and that the walk in Z 1 is recurrent. 

(ii) d > 2: define the event A n = {S n = e}. Then = limsup„ A n is the 
subset of f2, for which the walk returns to e infinitely many times. Since 
for d > 2, 

OO 

E[5]=^PL4 n ]=m(l)<oo, 

n=0 

the Borel-Cantelli lemma gives PfAoo] — for d > 2. The particle returns 
therefore to only finitely many times and similarly it visits each vertex in 
Fd only finitely many times. This means that the particle eventually leaves 
every bounded set and escapes to infinity. □ 

Remark. We could say that the problem of the random walk on a discrete 
group G is solvable if one can give an algebraic formula for the function 
m(x). We have seen that the classes of Abelian finitely generated and free 
groups are solvable. Trying to extend the class of solvable random walks 
seems to be an interesting problem. It would also be interesting to know, 
whether there exists a group such that the function mix) is transcendental. 

3.11 The free Laplacian on a discrete group 

Definition. Let G be a countable discrete group and A C G a finite set 
which generates G. The Cayley graph V of (G, A) is the graph with edges 
G and sites satisfying i — j e A or j — i e A. 

Remark. We write the composition in G additivcly even so we do not 
assume that G is Abelian. We allow A to contain also the identity e € G. 
In this case, the Cayley graph contains two closed loops of length 1 at each 
site. 

Definition. The symmetric random walk on T(G, A) is the process obtained 
by summing up independent uniformly distributed (A U A -1 )- valued ran- 
dom variables X n . More generally, we can allow the random variables X n 
to be independent but have any distribution on AU A . This distribution 
is given by numbers p a = p^ 1 € [0, 1] satisfying J2 a eAuA-i Pa = l - 

Definition. The free Laplacian for the random walk given by (G,A,p) is 
the linear operator on l 2 (G) defined by 

Lgh = Pg-h ■ 

Since we assumed p a = p a -i, the matrix L is symmetric: L g h = Lh g and 
the spectrum 

a{L) = {E G C | (L - E) is invcrtible } 
is a compact subset of the real line. 
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Remark. One can interpret L as the transition probability matrix of the 
random walk which is a " Markov chain" . We will come back to this inter- 
pretation later. 

Example. G = 7L, A = {1}. p = p a = 1/2 for a = 1,-1 and p a = for 
a g {1, -1}. The matrix 

. p 
p p 
L= p p 

p p 
p p 
p . 



is also called a Jacobi matrix. It acts on the Hilbcrt space l 2 (1) by (Lu) n = 
p(u n+1 + u n -x). 

Example. Let G = D% be the dihedral group which has the presentation 
G = (a, b\a 3 = 6 2 = (ab) 2 = 1). The group is the symmetry group of the 
equilateral triangle. It has 6 elements and it is the smallest non-Abelian 
group. Let us number the group elements with integers {1,2 = a, 3 = 
a 2 , 4 = 6,5 = ab, 6 = a 2 b }. We have for example 3 7k- 4 = a 2 b = 6 or 
3*5 = a 2 ab = a 3 b = b = 4. In this case A = {a, b}, A -1 = {a -1 , b} so that 
A U A^ 1 = {a, a ,b}. The Cayley graph of the group is a graph with 6 
vertices. We could take the uniform distribution p a = Pb = Pa- 1 = 1/3 on 
AU A^ 1 , but lets instead chose the distribution p a = p a -i = 1/4, pi = 1/2, 
which is natural if we consider multiplication by b and multiplication by 
b^ 1 as different. 



Example. The free Laplacian on D3 with the random walk transition prob- 
abilities p a = Pa- 1 = l/^,Pb = 1/2 is the matrix 



L = 






1/4 


1/4 


1/2 








1/4 





1/4 





1/2 





1/4 














1/2 


1/2 











1/4 


1/4 





1/2 





1/4 





1/4 








1/2 


1/4 









which has the eigenvalues (-3 ± \/5)/8, (5 ± V$)/8, 1/4, -3/4. 
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Figure. The Cayley graph of the 
dihedral group G = D3 is a reg- 
ular graph with 6 vertices and 9 
edges. 




A basic question is: what is the relation between the spectrum of L, the 
structure of the group G and the properties of the random walk on G. 

Definition. As before, let m n be the probability that the random walk 
starting in e returns in n steps to e and let 



be the generating function of the sequence m n . 



neG 



Proposition 3.11.1. The norm of L is equal to limsup n ^. 00 (m Il ) 1 /™, the 
inverse of the radius of convergence of m(x). 



Proof. Because L is symmetric and real, it is self-adjoint and the spectrum 
of L is a subset of the real line M. and the spectral radius of L is equal to 
its norm 

We have [L n ] ee = m n since [L n ] ee is the sum of products Y\" = iP aj each 
of which is the probability that a specific path of length n starting and 
landing at e occurs. 
It remains therefore to verify that 

limsupHL™!! 1 /" = limsup[L n ]i/™ 

and since the > direction is trivial we have only to show that < direction. 
Denote by E(A) the spectral projection matrix of L, so that dE(X) is a 
projection-valued measure on the spectrum and the spectral theorem says 
that L can be written as L = J A dE(X). The measure fi e = dE ee is called 
a spectral measure of L. The real number E(X) — E{n) is nonzero if and 
only if there exists some spectrum of L in the interval [A, Since 

(-1) [L n 



E^= [ (E - X)- 1 dk(E) 



3.11. The free Laplacian on a discrete group 



185 



can't be analytic in A in a point Ao of the support of dk which is the 
spectrum of L, the claim follows. □ 

Remark. We have seen that the matrix L defines a spectral measure fx e on 
the real line. It can be defined for any group element g, not only g = e and 
is the same measure. It is therefore also the so called density of states of L. 
If we think of fi as playing the role of the law for random variables, then 
the integrated density of states E(X) = -Fl(A) = /_ d/i(A) plays the role 
of the distribution function for real-valued random variables. 

Example. The Fourier transform U : 1 2 {I}) -> L 2 {T r ): 
u{x) = (Uu)(x) = ^ Un e mx 

diagonalises the matrix L for the random walk on Z 1 

(ULU*)u{x) = {{UL)(u n ){x)=pU(u n+1 +u n _ 1 )(x) 
= p)^(un + i + u n -i)e lnx 

= P Y,Me l{n ~ 1)x + e l{n+1)x ) 

= P J2Me ix + e~ lx )e mx 
nez 

= pV u n 2 cos(x)e mx 
nez 

= 2pcos(x) ■ u{x) . 

This shows that the spectrum of ULU* is [—1, 1] and because U is an 
unitary transformation, also the spectrum of L is in [— 1, 1]. 

Example. Let G = Z d and A = {ei}f =1 , where {e^} is the standard bases. 
Assume p = p a = l/(2d). The analogous Fourier transform F : Z 2 (Z d ) — > 
L 2 (T d ) shows that FLF* is the multiplication with \ Y?j=i cos(xj-). The 
spectrum is again the interval [—1,1]. 

Example. The Fourier diagonalisation works for any discrete Abelian group 
with finitely many generators. 

Example. G = Fd the free group with the natural d generators. The spec- 
trum of L is 

V2d- 1 V2d- 1, 
[ d ' d J 
which is strictly contained in [—1,1] if d > 1 . 

Remark. Kesten has shown that the spectral radius of L is equal to 1 if 
and only if the group G has an invariant mean. For example, for a finite 
graph, where L is a stochastic matrix, a matrix for which each column is a 
probability vector, the spectral radius is 1 because L T has the eigenvector 
(!,...,!) with eigenvalue 1. 
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Random walks and Laplacian can be denned on any graph. The spectrum 
of the Laplacian on a finite graph is an invariant of the graph but there are 
non-isomorphic graphs with the same spectrum. There are known infinite 
self-similar graphs, for which the Laplacian has pure point spectrum [65]. 
There are also known infinite graphs, such that the Laplacian has purely 
singular continuous spectrum [99]. For more on spectral theory on graphs, 
start with [6]. 



3.12 A discrete Feynman-Kac formula 

Definition. A discrete Schrodinger operator is a bounded linear operator 
L on the Hilbert space l 2 (Z d ) of the form 

d 

(Lu)(n) = u(n + e,) — 2u(n) + u(n — e,) + V{n)u(n) , 
i=i 

where V is a bounded function on 7L d . They are discrete versions of op- 
erators L = — A + V(x) on L 2 (M. d ), where A is the free Laplacian. Such 
operators are also called Jacobi matrices. 



Definition. The Schrodinger equation 

ihii = Lu, u(0) = uq 

is a differential equation in l 2 (Z d ,C) which describes the motion of a com- 
plex valued wave function u of a classical quantum mechanical system. The 
constant h is called the Planck constant and i = \f—l is the imaginary 
unit. Lets assume to have units where h = 1 for simplicity. 



Remark. The solution of the Schrodinger equation is 

u t = ei L uo . 

The solution exists for all times because the von Neumann series 

t 2 L 2 t 3 L 3 



e tL = l + tL 



2! 3! 
is in the space of bounded operators. 

Remark. It is an achievement of the physicist Richard Feynman to see 
that the evolution as a path integral. In the case of differential operators 
L, where this idea can be made rigorous by going to imaginary time and 
one can write for L = — A + V 

e- t --u{x)=K x [eK v ^ da u Q { 1 {t))] , 

where E x is the expectation value with respect to the measure on the 
Wiener space of Brownian motion starting at x. 
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Here is a discrete version of the Feynman-Kac formula: 

Definition. The Schrodinger equation with discrete time is defined as 



and we denote the right hand side with L n ut- 

Definition. Denote by T n (i,j) the set of paths of length n in the graph 
G having as edges Z d and sites pairs with \i — j\ < 1. The graph G 
is the Cayley graph of the group Z d with the generators A U A^ 1 U {e}, 
where A = {ex, . . . , e<j, } is the set of natural generators and where e is the 
identity. 

Definition. Given a path 7 of finite length n, we use the notation 



Let ft is the set of all paths on G and E denotes the expectation with 
respect to a measure P of the random walk on G starting at 0. 



Theorem 3.12.1 (Discrete Feynman-Kac formula). Given a discrete 
Schrodinger operator L. Then 



i(u t+e - u t ) = eLu t , 



where e > is fixed. We get the evolution 



Ut+ne 



(l-ieL) n u t 





Proof. 



(L n u)(0) 



J 




i 7er„(o,j) 




□ 



Remark. This discrete random walk expansion corresponds to the Feynman- 
Kac formula in the continuum. If we extend the potential to all the sites of 
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the Cayley graph by putting V([k, k]) = V(k) and V([k, I}) = for k^l, 
we can define exp(/^ V) as the product 11"= i ^([t(*))7(* + I)])- Then 

(L n u)(0) = E[exp( / V>( 7 (n))] 
Jo 

which is formally the Feynman-Kac formula. 

In order to compute (L n u)(k) with L = (1 — fceL), we have to take the 
potential v defined by 

u([fc, k]) = 1 — iev(^(k)) . 

Remark. The Schrodinger equation with discrete time has the disadvantage 
that the time evolution of the quantum mechanical system is no more 
unitary. This draw-back could be overcome by considering also ih(ut — 
ut-e) = eLut so that the propagator from ut-e to ut+ e is given by the 
unitary operator 

which is a Cayley transform of L. See also [51], where the idea is disussed 
to use L = arccos(ai), where L has been rescaled such that aL has norm 
smaller or equal to 1. The time evolution can then be computed by iterating 
the map A : (ip, 0) H> (2aLip - 0, ip) on H H. 



3.13 Discrete Dirichlet problem 

Also for other partial differential equations, solutions can be described prob- 
abilistically. We look here at the Dirichlet problem in a bounded discrete 
region. The formula which we derive in this situation holds also in the 
continuum limit, where the random walk is replaced by Brownian motion. 

Definition. The discrete Laplacian on 1? is defined as 

A/(n,m) = /(n+1, m) + f{n— 1, m) + f(n, m+1) + f(n, m— 1) — 4/(n, m) . 

With the discrete partial derivatives 

S+f(n,m) = i(/(n+l,m)-/(n,m)), S~f(n,m) = ^(f(n,m)-f(n-l,m)) , 

Syf(n,m) = i(/(n,m+l)-/(n,m)), <5~ f{n,m) = ^-(f(n,m)-f{n,m-l)) , 

the Laplacian is the sum of the second derivatives as in the continuous case, 
where A = f xx + f yy : 

A = 5+8- + 5+S- . 

The discrete Laplacian in Z 3 is defined in the same way as a discretisation 
of A = f xx + fyy + f zz . The setup is analogue in higher dimensions 

1 d 

(Au)(n) = — y^(u(n + e,) + u(n - e t ) - 2u(n)) , 
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where ei, . . . , ed is the standard basis in 1 d . 

Definition. A bounded region D in 7L d is a finite subset of Z d . Two points 
are connected in D if they are connected in I? . The boundary SD of D 
consists of all lattice points in D which have a neighboring lattice point 
which is outside D. Given a function / on the boundary SD, the discrete 
Dirichlet problem asks for a function u on D which satisfies the discrete 
Laplace equation Ait = in the interior int(-D) and for which u = f on the 
boundary SD. 



Figure. The discrete Dirichlet 
problem is a problem in lin- 
ear algebra. One algorithm to 
solve the problem can be restated 
as a probabilistic "path integral 
method". To find the value of u 
at a point x, look at the "dis- 
crete Wiener space" of all paths 
7 starting at x and ending at 
some boundary point St{^>) € 
5D of D. The solution is u(x) = 




Definition. Let Q, x ,n denote the set of all paths of length n in D which start 
at a point x € D and end up at a point in the boundary SD. It is a subset 
of r^n, the set of all paths of length n in Z d starting at x. Lets call it the 
discrete Wiener space of order n defined by x and D. It is a subset of the 
set r i n which has 2 dn elements. We take the uniform distribution on this 
finite set so that P x , n [{j}} = l/2 dn . 

Definition. Let L be the matrix for which L x . y = l/(2d) if x,y € Z d are 
connected by a path and x is in the interior of D. The matrix L is a bounded 
linear operator on l 2 (D) and satisfies L x z = L z x for x, z E int(D) = D\SD. 
Given / : SD — > R, we extend / to a function F(x) = 0on J D = D \ SD 
and F(x) = f(x) for x G SD. The discrete Dirichlet problem can be restated 
as the problem to find the solution u to the system of linear equations 

(1 - L)u = f . 



Lemma 3.13.1. The number of paths in fl Xtn starting at x € D and ending 
at a different point y £ D is equal to (2d) n L™ . 
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Proof. Use induction. By definition, L xz is 1/ (2d) if there is a path from x 
to z. The integer L™ is the number of paths of length n from x to y. □ 



Figure. Here is an example of a 
problem where D <Z 1? has 10 
points: 
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Only the rows corresponding to 
interior points are nonzero. 
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Definition. For a function / on the boundary 5D, define 

E x , n [f] = E f^ L lv 

yeSD 

and 

oo 

E x [f] = J2^Af\ ■ 

This functional defines for every point i£fla probability measure (m x on 
the boundary &D. It is the discrete analog of the harmonic measure in the 
continuum. The measure P x on the set of paths satisfies E^l] = 1 as we 
will just see. 



Proposition 3.13.2. Let S n be the random walk on 1 d and let T be the 

stopping time which is the first exit time of S from D. The solution to the 
discrete Dirichlet problem is 

u(x)=E x [f(S T )} ■ 



Proof. Because (1 — L)u = f and 

E a , n [/] = (L n f) 
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we have from the geometric series formula 

n 

(l-A)- 1 = Y J A k 

k=0 

the result 

oo oo 

u{x) = (1 - L)~ 1 f(x) = = E E *>«W = E ^ ■ 

n=0 n=0 

Define the matrix K by Kjj = 1 for j £ 3D and = Lji/A else. The 
matrix K is a stochastic matrix: its column vectors are probability vectors. 
The matrix K has a maximal eigenvalue 1 and so norm 1 {K T has the 
maximal eigenvector (1, 1, . . . , 1) with eigenvalue 1 and since eigenvalues of 
K agree with eigenvalues of K T ). Because ||L|| < 1, the spectral radius of 
L is smaller than 1 and the series converges. If / = 1 on the boundary, 
then u = 1 everywhere. From E x [l] = 1 follows that the discrete Wiener 
measure is a probability measure on the set of all paths starting at x. □ 




Figure. The random 
walk defines a diffu- 
sion process. 



Figure. The diffusion 
process after time t = 
2. 



Figure. The diffusion 
process after time t = 
3. 



The path integral result can be generalized and the increased generality 
makes it even simpler to describe: 

Definition. Let (D, E) be an arbitrary finite directed graph, where D is 
a finite set of n vertices and E C D x D is the set of edges. Denote an 
edge connecting i with j with ey. Let if be a stochastic matrix on 1 2 {D): 
the entries satisfy > and its column vectors are probability vectors 
J2ieD = ^ f° r a ^ J e The stochastic matrix encodes the graph and 
additionally defines a random walk on D if is interpreted as the tran- 
sition probability to hop from j to i. Lets call a point j £ SD a boundary 
point, if Kjj = 1. The complement intl? = D\6D consists of interior points. 
Define the matrix L as Ljj = if j is a boundary point and Ljj = Kji 
otherwise. 
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The discrete Wiener space O x C D on D is the set of all finite paths 7 = 
(x = xo, x±, X2, ■ . ■ , x n ) starting at a point x G D for which K XiXi+1 > 0. 
The discrete Wiener measure on this countable set is defined as Pa;[{7}] = 
rij=o -^iij+i' A. function u on D is called harmonic if (Lu) x = for all 
x G D. The discrete Dirichlet problem on the graph is to find a function u 
on D which is harmonic and which satisfies u = / on the boundary SD of 
D. 



Theorem 3.13.3 (The Dirichlet problem on graphs). Assume D is a directed 
graph. If S n is the random walk starting at x and T is the stopping time 
to reach the boundary of D, then the solution 

u = E x [f(S T )} 

is the expected value of St on the discrete Wiener space of all paths starting 
at x and ending at the boundary of D. 



Proof. Let F be the function on D which agrees with / on the boundary of 
D and which is in the interior of D. The Dirichlet problem on the graph 
is the system of linear equations (1 — L)u = f. Because the matrix L has 
spectral radius smaller than 1, the problem is given by the geometric series 

00 

But this is the sum E :e [/(S't)] over all paths 7 starting at x and ending at 
the boundary of /. □ 

Example. Lets look at a directed graph (D, E) with 5 vertices and 2 bound- 
ary points. The Laplacian on D is defined by the stochastic matrix 
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Given a function / on the boundary of D, the solution u of the discrete 
Dirichlet problem (1 — L)u = f on this graph can be written as a path 
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integral J^^Lo L n f = E x [/(St)] for the random walk S n on D stopped at 
the boundary 5D. 



Figure. The directed graph 
(D,E) with 5 vertices and 2 
boundary points. 




Remark. The interplay of random walks on graphs and discrete partial 
differential equations is relevant in electric networks. For mathematical 
treatments, see [19, 103]. 

3.14 Markov processes 

Definition. Given a measurable space (S, B) called state space, where S is 
a set and B is a cr-algebra on S. A function P : S x B —¥ R is called a 
transition probability function if P(x, ■) is a probability measure on (S,B) 
for all x £ S and if for every B £ B, the map s — > P(s, B) is S-mcasurablc. 
Define P 1 (x,B) = P(x,B) and inductively the measures P n+1 (x, B) = 
f s P n (y,B)P(x,dy), where we write J P(x 7 dy) for the integration on S 
with respect to the measure P(x, •)■ 

Example. If S is a finite set and B is the set of all subsets of S. Given 
a stochastic matrix K and a point s £ S, the measures P(s,-) are the 
probability vectors, which are the columns of K . 

A set of nodes with connections is a graph. Any network can be described by 
a graph. The link structure of the web forms a graph, where the individual 
websites are the nodes and if there is an arrow from site ai to site dj if a% 
links to aj. The adjacency matrix A of this graph is called the web graph. 
If there are n sites, then the adjacency matrix isanxn matrix with entries 
Aij = 1 if there exists a link from aj to Oi . If we divide each column by the 
number of 1 in that column, we obtain a Markov matrix A which is called 
the normalized web matrix. Define the matrix E which satisfies JSy = 1/n 
for all i,j. The graduate students and later entrepreneurs Sergey Brin and 
Lawrence Page had in 1996 the following "one billion dollar idea": 

Definition. A Google matrix is the matrix G = dA + (1 — d)E, where 
< d < 1 is a parameter called damping factor and A is the stochastic 
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matrix obtained from the adjacency matrix of a graph by scaling the rows 
to become stochastic matrices. This is a stochastic n x n with eigenvalue 
1. The corresponding eigenvector v scaled so that the largest value is 10 is 
called page rank of the damping factor d. 

Page rank is probably the world's largest matrix computation. In 2006, one 
had n=8.1 billion. [57] 

Remark. The transition probability functions are elements in C(S, Mi(S)), 
where M\(S) is the set of Borel probability measures on S. With the mul- 
tiplication 

(PoQ){x,B)= f P{y,B)dQ{x) 
Js 

we get a commutative semi-group. The relation p n + m = p n o P m is also 
called the Chapmann-Kolmogorov equation. 

Definition. Given a probability space (f2, A, P) with a filtration A n of er- 
algebras. An .A n -adapted process X n with values in S is called a discrete 
time Markov process if there exists a transition probability function P such 
that 

P[X n G B | Ak](u) = P n - k (X k (tu),B) . 

Definition. If the state space S is a discrete space, a finite or countable 
set, then the Markov process is called a Markov chain, A Markov chain is 
called a denumerable Markov chain, if the state space S is countable, a 
finite Markov chain, if the state space is finite. 

Remark. It follows from the definition of a Markov process that X n satisfies 
the elementary Markov property: for n > k, 

P[X n G B | Xx, X k ] = P[X n G B | X k ] . 

This means that the probability distribution of X n is determined by know- 
ing the probability distribution of X n _\. The future depends only on the 
present and not on the past. 



Theorem 3.14.1 (Markov processes exist). For any state space (S,B) and 
any transition probability function P, there exists a corresponding Markov 
process X. 



Proof. Choose a probability measure [i on (S, B) and define on the prod- 
uct space (fl,A) = (5 N ,B N ) the 7r-system C consisting of of cylinder-sets 
ringN given by a sequence B n G B such that B n = S except for finitely 
many n. Define a measure P = P^ on (CI, C) by requiring 

P[wfe G B k , k = 1, . . .n] = / fi(dxo) / P(x , dxi) . . . / P(x n -i, dx n ) . 
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This measure has a unique extension to the er-algcbra A. 
Define the increasing sequence of cr-algebras A n = B n x Y[™ =1 {$,Q} con " 
taining cylinder sets. The random variables X n (uj) = x n are ^"-adapted. 
In order to see that it is a Markov process, we have to check that 

P[X n £ B n | = P{X n -i{u),B n ) 

which is a special case of the above requirement by taking Bk = S for 
k ^ n. □ 

Example. Independent S-valued random variables 

Assume the measures P(x, ■) arc independent of x. Call this measure P. In 
this case 

P[X n e B n | An-i](u) = P[B n ] 

which means that P[X n g B n \ A n -i] = P[X n G B n ]. The S— valued 
random variables X n are independent and have identical distribution and 
P is the law of X n . Every sequence of IID random variables is a Markov 
process. 

Example. Countable and finite state Markov chains. 

Given a Markov process with finite or countable state space S. We define 
the transition matrix Py on the Hilbert space l 2 (S) by 

P^=P(i,{j}) ■ 

The matrix P transports the law of X n into the law of X n+ \. 
The transition matrix is a stochastic matrix: each column is a proba- 
bility vector: £^ . = 1 with Py > 0. Every measure on S can be given 
by a vector it S l 2 (S) and Pit is again a measure. If Xq is constant and 
equal to i and X n is a Markov process with transition probability P, then 
pn=P[X n =j}. 

Example. Sum of independent S'-valued random variables Let S be a count- 
able Abclian group and let n be a probability distribution on S assigning 
to each j <G S the weight ttj. Define Py- = itj-i. Now X n is the sum of n 
independent random variables with law tt. The sum changes from i to j 
with probability Py = Pi—j . 

Example. Branching processes Given S = {0,1,2... } = N with fixed 
probability distribution tt. If A is a S'-valued random variable with distri- 
bution 7r then Y^k=i Xk has a distribution which we denote by 7P n '. Define 

(i) 

the matrix Pjj = TTj ■ The Markov chain with this transition probability 
matrix on S is called a branching process. 

Definition. The transition probability function P acts also on measures tt 
of Sby 

V(tt)(B) = [ P{x,B) dTr(x) . 
Js 

A probability measure tt is called invariant if Vtt = tt. An invariant measure 
tt on S is called stationary measure of the Markov process. 
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This operator on measures leaves a subclass of measures with densities with 
respect to some measure v invariant. We can so assign a Markov operator 
to a transition probability function: 



Lemma 3.14.2. For any igS define the measure 

on (S, B) has the property that if \x is absolutely continuous with respect 
to v 1 then also Vfi is absolutely continuous with respect to v. 



Proof. Given fi = f ■ v with / G L^iS). Lets assume that / > because in 
general we can write / = / + — /~, where f ± are both nonnegative. If we 
show that fj, = f v are both absolutely continuous also fi = ji + — pT is 
absolutely continuous. 
Now, 

Vn = / P(x,B)f(x) dv{x) 
Js 

is absolutely continuous with respect to v because Vfi(B) = implies 
P(x, B) — for almost all x with f(x) > and therefore f(x)P n (x, B) = 
for all n and so f(x)v{B) = implying v{B) =0. □ 



Corollary 3.14.3. To each transition probability function can be assigned a 
Markov operator V : L l (S, v) -> ^(S, v). 



Proof. Choose v as above and define 

Vfl = f 2 

if V\i\ = [ii with fii = fiv. To check that V is a Markov operator, we have 
to check Vf > if / > 0, which follows from 

Vfv{B)= f P{x,B)f{x)dv(x)>Q. 

We also have to show that HP/Hi = 1 if ||/||i = 1- It is enough to show 
this for elementary functions / = Y) j cij 1 ^ with cij > with Bj G B 
satisfying J^j a j v {Bj) = 1 satisfies ||P(lsi/)|| = v{B). But this is obvious 
\\P{l B v)\\=j B P{x,-)dv{x) = v{B). □ 
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We see that the abstract approach to study Markov operators on ^(S) is 
more general, than looking at transition probability measures. This point 
of view can reduce some of the complexity, when dealing with discrete time 
Markov processes. 
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Chapter 4 

Continuous Stochastic 
Processes 

4.1 Brownian motion 

Definition. Let (f2, A, P) be a probability space and let T C R be time. 
A collection of random variables X t , t G T with values in R is called a 
stochastic process. If X t takes values in S — R d , it is called a vector- valued 
stochastic process but one often abbreviates this by the name stochastic 
process too. If the time T can be a discrete subset of R, then Xt is called 
a discrete time stochastic process. If time is an interval, R + or R, it is 
called a stochastic process with continuous time. For any fixed u <E f2, one 
can regard X t (ui) as a function of t. It is called a sample function of the 
stochastic process. In the case of a vector-valued process, it is a sample 
path, a curve in M. d . 



Definition. A stochastic process is called measurable, if X : T X fi — > S is 
measurable with respect to the product er-algebra B(T) x A. In the case of 
a real- valued process (S = R), one says X is continuous in probability if 
for any t £ M. the limit X t +h — > X t takes place in probability for h — > 0. 
If the sample function X t (w) is a continuous function of t for almost all u, 
then Xt is called a continuous stochastic process. If the sample function is 
a right continuous function in t for almost all oj € 0, Xt is called a right 
continuous stochastic process. Two stochastic process X t and Y t satisfying 
P[X t — Yt = 0] = 1 for all t G T are called modifications of each other 
or indistinguishable. This means that for almost all w G O, the sample 
functions coincide Xt(u>) = Yt(uj). 

Definition. A R"-valued random vector X is called Gaussian, if it has the 
multidimensional characteristic function 

x (s) = E[e ls - X ] = e -(«,V«)/2+l(m,.) 
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for some nonsingular symmetric nx n matrix V and vector m = E [X] . The 
matrix V is called covariance matrix and the vector m is called the mean 
vector. 

Example. A normal distributed random variable X is a Gaussian random 
variable. The covariance matrix is in this case the scalar Var[JT]. 

Example. If V is a symmetric matrix with determinant dct(V) ^ 0, then 
the random variable 

X( X ) = 1 p -{x- m ,y- 1 (x-m.))/2 

(2^)«/ 2 Vdct(v7 

on O = R" is a Gaussian random variable with covariance matrix V. To 
see that it has the required multidimensional characteristic function 4>x{u)- 
Note that because V is symmetric, one can diagonalizc it. Therefore, the 
computation can be done in a bases, where V is diagonal. This reduces the 
situation to characteristic functions for normal random variables. 

Example. A set of random variables X\ , . . . , X n are called jointly Gaussian 

if any linear combination a iXi is a Gaussian random variable too. 

For a jointly Gaussian set of of random variables Xj, the vector X = 
(Ai, . . . , A„) is a Gaussian random vector. 

Example. A Gaussian process is a Revalued stochastic process with con- 
tinuous time such that {X to , X tl , . . . , X tn ) is jointly Gaussian for any to < 
ti < ■ ■ ■ < t n . It is called centered if m t = E[A t ] = for all t. 

Definition. An Revalued continuous Gaussian process X t with mean vector 
m t = E[X t ] and the covariance matrix V(s,t) = Cov[X s , X t ] = E[(A S — 
m s )- (X t —m t )*] is called Brownian motion if for any 0<to < ti < 1 " ■ < t n , 
the random vectors X to ,X ti+1 — X ti are independent and the covariance 
matrix V satisfies V(s,t) = V(r 7 r), where r = mm(s,t) and s h-> V(s,s). 
It is called the standard Brownian motion if m t = for all t and V(s, t) = 
min{s, t}. 



Figure. A path X t (uji) of Brow- 
nian motion in the plane S = M 2 
with a drift m t = E[A t ] = (i,0). 
This is not standard Brownian 
motion. The process Y t = X t — 
(<:, 0) is standard Brownian mo- 
tion. 



4.1. Brownian motion 
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Recall that for two random vectors X, Y with mean vectors m, n, the covari- 
ance matrix is Cov[X, Y]ij = E[(X< — mi)(Yj — nj)}. We say Cov[X, Y] = 
if this matrix is the zero matrix. 



Lemma 4.1.1. A Gaussian random vector (X, Y) with random vectors X, Y 
satisfying Cov[X, Y] = has the property that X and Y are independent. 



Proof. We can assume without loss of generality that the random variables 
X, Y are centered. Two R"-valued Gaussian random vectors X and Y are 
independent if and only if 

<f>(x,Y)(s,t) = 4> x {s) ■ 4> Y (t),Vs,t £ R' 1 

Indeed, if V is the covariance matrix of the random vector X and W is the 
covariance matrix of the random vector Y, then 



U = 



u 


Cov[X,Y] ' 




' U 


Cov[Y,X] 


V 




V 



is the covariance matrix of the random vector (X, Y). With r = (t,s), we 
have therefore 

4> {x>Y) {r) = E[^<X,Y) ]=e -Ur-Ur) 
= & -\{s-Vs)-\{t-Wt) 
= e -|(s-Vs) e -|(t-Wt) 

= <f>x(s)<l>Y{t) ■ 

□ 

Example. In the context of this lemma, one should mention that there 
exist uncorrelated normal distributed random variables X, Y which are not 
independent [114]: Proof. Let X be Gaussian on K and define for a > the 
variable Y(cj) = —X(u>), if u> > a and Y = X else. Also Y is Gaussian and 
there exists a such that E[XY] = 0. But X and Y are not independent and 
X+Y = on [—a, a] shows that X+Y is not Gaussian. This example shows 
why Gaussian vectors (X, Y) are defined directly as M 2 valued random 
variables with some properties and not as a vector (X, Y) where each of 
the two component is a onc-dimcnsional random Gaussian variable. 



Proposition 4.1.2. If X t is a Gaussian process with covariance V(s,t) = 
V(r, r) with r = min(s, t), then it is Brownian motion. 
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Proof. By the above lemma (4.1.1), we only have to check that for all i < j 

Cov[X t0 ,X tj+1 - X tj ] = 0, Cov[X u+1 - X u ,X tj+1 - X tj ] = . 
But by assumption 

Cov[X t0 ,X t . +1 - X tj ] = V(t ,t j+1 ) - V(t ,t 3 ) = V{t ,t ) ~ V{t ,t ) = 
and 

Cov[X ti+1 - X u ,X tj+1 - X tj ] = V{t i+1 ,t j+1 ) - V{t i+1 ,tj) 

-V(tut j+ i) + V(ti,tj) 
= V(U+i, ti+i) — V(ti+i, ti+i) 
-V(t i ,t i ) + V(t i ,t i ) = . 

□ 

Remark. Botanist Robert Brown was studying the fertilization process in a 
species of flowers in 1828. While watching pollen particles in water through 
a microscope, he observed small particles in "rapid oscillatory motion". 
While previous studies concluded that these particles were alive, Brown's 
explanation was that matter is composed of small " active molecules" , which 
exhibit a rapid, irregular motion having its origin in the particles themselves 
and not in the surrounding fluid. Brown's contribution was to establish 
Brownian motion as an important phenomenon, to demonstrate its presence 
in inorganic as well as organic matter and to refute by experiment incorrect 
mechanical or biological explanations of the phenomenon. The book [75] 
includes more on the history of Brownian motion. 

The construction of Brownian motion happens in two steps: one first con- 
structs a Gaussian process which has the desired properties and then shows 
that it has a modification which is continuous. 



Proposition 4.1.3. Given a separable real Hilbert space (H, \\ ■ ||). There 
exists a probability space (Q, A, P) and a family X(h) 7 h e H of real- valued 
random variables on fi such that h i-> X(h) is linear, and X(h) is Gaussian, 
centered and E[X(» 2 ] = ||/i|| 2 . 



Proof. Pick an orthonormal basis {e„} in H and attach to each e„ a cen- 
tered Gaussian IID random variable X n <G £ 2 satisfying ||^ n ||2 = 1- Given 
a general h — ^2 h n e n G H , define 

X(h) =J2h n X n 

n 

which converges in C 2 . Because X n are independent, they are orthonormal 
in C 2 so that 

\\X{h)\\l = Y,hl\\X n \\ 2 =Y,hi = \\h\\%. 
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□ 

Definition. If we choose H = L 2 (R + ,dx), the map X : H M> C 2 is 
also called a Gaussian measure. For a Borel set A C K + wc define then 
X{A) = X(1a)- The term "measure" is warranted by the fact that X(A) = 
J2 n X(A n ) if A is a countable disjoint union of Borel sets A n . One also has 
X(0) = O. 

Remark. The space X{H) C £ 2 is a Hilbert space isomorphic to H and in 
particular 

E[X(/i)X(/i')] = (h,ti) . 

We know from the above lemma that h and h! are orthogonal if and only 
if X(h) and X(h') are independent and that 

E[X(A)X(B)] = Coy[X(A), X(B)] = (U, Is) = |A n B\ . 

Especially X(A) and X(B) are independent if and only if A and B are 
disjoint. 

Definition. Define the process B t = XQ0,t]). For any sequence tx,t2, ■ • • € 
T, this process has independent increments B ti — B ti _ 1 and is a Gaussian 
process. For each t, we have E[B( ] = t and for s < t, the increment i?t — B s 
has variance t — s so that 

E[B s B t ] = E[B 2 S ] + E[B s (B t - B 3 )} = E[B 2 ] = s . 

This model of Brownian motion has everything except continuity. 



Theorem 4.1.4 (Kolmogorov's lemma). Given a stochastic process Xt with 
t E [a,b] for which there exist three constants p > r,K such that 

E[\X t+h -X t \P] <K-h 1+r 

for every t,t + h G [a, b], then X t has a modification Y t which is almost 
everywhere continuous: for all s,t€ [a, b] 

\Y t (u)-Y B (u)\<C(u) \t-s\ a ,0<a<- . 



Proof. We can assume without loss of generality that a = 0, b = 1 because 
we can translate and rescale the time variable to be in this situation. Define 
e = r — ap. By the Chebychev-Markov inequality (2.5.4) 

P[|* t+h -X t \]> \h\ a ] < \h\- ap E[\X t+h - X t \P] < K\h\ 1+ * 

so that 

P[|X (fe+1)/2 „ - X k/2n \ > 2-™} < K2- n ^ . 
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Therefore 

oo 2™-l 

£ £ P[\X (k+1)/2n - X k/2n \ > 2-""] < oo . 

n=l fc=0 

By the first Borel-Cantelli's lemma (2.2.2), there exists n(w) < oo almost 
everywhere such that for all n > n(w) and k = 0, . . . , 2" — i 

l*(k+i)/2"M-*k/2«MI <2-' M . 

Let n > n(uj) and t G [fc/2 n , (fc + l)/2 n ] of the form i = /c/2"+E™ i 7i/2" +l 
with 74 G {0,1}. Then 

m 

\X t (u) - X k2 - n (uj)\ <Y,li2- Q(n+l) < d 2~ na 

i=l 

with d= (l-2- a )- 1 . Similarly 

\Xt - -X"(fe+i)2-" I < d 2~ na . 

Given t,t + h G D = {k2~ n | n G N, k = 0, . . . n - 1}. Take n so that 
2"™- 1 < ft < 2"" and fc so that fc/2" +1 < i < (fc + l)/2 ,l+1 . Then (k + 
l)/2 ,l+1 < t + h < (k + 3)/2 n+1 and 

\X t+h -X t \< 2d2- {n+1)a < 2dh a . 

For almost all w, this holds for sufficiently small ft. 

We know now that for almost all w, the path X t (uj) is uniformly continuous 
on the dense set of dyadic numbers D = {k/2 n }. Such a function can be 
extended to a continuous function on [0, 1] by defining 

Y t (u) = lim X„(u) . 

Because the inequality in the assumption of the theorem implies E[X t (w) — 
lim s6 £)_j. t X s (u>)] = and by Fatou's lemma E[Y t (uj)~ \im se r)^t X s (lu)] = 
we know that Xt — Yt almost everywhere. The process Y is therefore a 
modification of X. Moreover, Y satisfies 



\Y t (ui)-Y a {u)\ < C{lu) \t-s\ c 



for all s, t G [a, b]. 



□ 



Corollary 4.1.5. Brownian motion exists. 



4.1. Brownian motion 
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Proof. In one dimension, take the process Bt from above. Since Xh = 
Bt+h — Bt is centered with variance h, the fourth moment is E[X^] = 
-£zexp(-x 2 h/2)\ x=0 = 3h 2 , so that 



Kolmogorov's lemma (4.1.4) assures the existence of a continuous modifi- 
cation of B. 

To define standard Brownian motion in n dimension, we take the joint 
motion B t = {B^\ . . . , B^) of n independent one- dimensional Brownian 



Definition. Let B t be the standard Brownian motion. For any x £ K™, the 
process Xf = x + B t is called Brownian motion started at x. 

The first rigorous construction of Brownian motion was given by Norbert 
Wiener in 1923. By construction of a Wiener measure on C[0, 1], one has 
a construction of Brownian motion, where the probability space is directly 
given by the set of paths. One has then the process X t (oj) = uj(t). We will 
come to this later. A general construction of such measures is possible given 
a Markov transition probability function [108]. The construction given here 
is due to Neveu and goes back to Kakutani. It can be found in Simon's book 
on functional integration [97] or in the book of Revuz and Yor [86] about 
continuous martingales and Brownian motion. This construction has the 
advantage that it can be applied to more general situations. 

In McKean's book "Stochastic integrals" [68] one can find Levy's direct 
proof of the existence of Brownian motion. Because that proof gives an ex- 
plicit formula for the Brownian motion process Bt and is so constructive, 
we outline it shortly: 

1) Take basis in L 2 ([Q, 1] the Haar functions 



for {{k,n)\n > l,k < 2™ } and / , = 1- 

2) Take a family X k<n for (fc, n) G I = {(&, n) \ n > 1, k < 2", k odd } U 
{(0, 0) } of independent Gaussian random variables. 



E[(B t+h - B t ) 4 } = 3h 2 . 



motions. 



□ 




3) Define 




4) Prove convergence of the above series. 



5) Check 
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6) Extend the definition from t 6 [0, 1] to £ € [0, oo) by taking independent 
Brownian motions and defining B t = J2 n< r t i B^} n , where [£] is the 
largest integer smaller or equal to t. 

4.2 Some properties of Brownian motion 

We first want to establish that Brownian motion is unique. To do so, we 
first have to say, when two processes are the same: 

Definition. Two processes X t on (CI, A, P) and X[ on (fi', A' , P') are called 
indistinguishable, if there exists an isomorphism U : CI — >• CI' of probability 
spaces, such that X' t (Uuj) = Xt(uj). Indistinguishable processes are consid- 
ered the same. A special case is if the two processes arc defined on the same 
probability space (Cl,A,P) and X t (oj) = Y t {oj) for almost all w. 



Proposition 4.2.1. Brownian motion is unique in the sense that two stan- 
dard Brownian motions are indistinguishable. 



Proof. The construction of the map H — > £ 2 was unique in the sense that 
if we construct two different processes X(h) and Y(h), then there exists an 
isomorphism U of the probability space such that X(h) = Y(U(h)). The 
continuity of X t and Y t implies then that for almost all w, X t (uj) = Y t (Uui). 
In other words, they are indistinguishable. □ 

We are now ready to list some symmetries of Brownian motion. 



Theorem 4.2.2 (Properties of Brownian motion). The following symmetries 
exist: 

(i) Time-homogeneity: For any s > 0, the process B t = B t + S — B s is a 
Brownian motion independent of o~(B u , u < s). 

(ii) Reflection symmetry: The process B t = —B t is a Brownian motion. 

(iii) Brownian scaling: For every c > 0, the process B t = cB t / c 2 is a Brow- 
nian motion. 

(iv) Time inversion: The process Bo = 0, Bt = tB 1 / t ,t > is a Brownian 
motion. 



Proof. (i),(ii),(iii) In each case, B t is a continuous centered Gaussian pro- 
cess with continuous paths, independent increments and variance t. 
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(iv) B is a centered Gaussian process with covariance 

Cov[B a ,B t ] = E[B s ,B t ) = st-E[B 1/s ,B 1/t ] = si • inf(- ,-) = mi(s,t) . 

s t 

Continuity of B t is obvious for t > 0. We have to check continuity only for 
t = 0, but since E[£?^] = s — > for s — > 0, we know that B s — > almost 
everywhere. □ 

It follows the strong law of large numbers for Brownian motion: 



Theorem 4.2.3 (SLLN for Brownian motion). If Bt is Brownian motion, 
then 

lim -B t = 

almost surely. 



Proof. From the time inversion property (iv), we see that t~ 1 Bt = Bi/ t 
which converges for t — > oo to almost everywhere, because of the almost 
everywhere continuity of B t . □ 

Definition. A parameterized curve t G [0, oo) h- >■ Xt G R™ is called Holder 
continuous of order a if there exists a constant C such that 

\\X t+h -X t \\ <C-h a 

for all h > and alH. A curve which is Holder continuous of order a = 1 
is called Lipshitz continuous. 

The curve is called locally Holder continuous of order a if there exists for 
each t a constant C = C(t) such that 

\\X t+h -X t \\<C-h a 

for all small enough h. For a Revalued stochastic process, (local) Holder 
continuity holds if for almost all uj G Q the sample path Xt(oj) is (local) 
Holder continuous for almost all u G Q. 



Proposition 4.2.4. For every a < 1/2, Brownian motion has a modification 
which is locally Holder continuous of order a. 
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Proof. It is enough to show it in one dimension because a vector func- 
tion with locally Holder continuous component functions is locally Holder 
continuous. Since increments of Brownian motion arc Gaussian, we have 

E[(B t -B s )^]^C p -\t- S \P 

for some constant C p . Kolmogorov's lemma assures the existence of a mod- 
ification satisfying locally 

\B t -B s \<C\t-s\ a ,0<a<^^ . 

2p 

Because p can be chosen arbitrary large, the result follows. □ 

Because of this proposition, we can assume from now on that all the paths 
of Brownian motion are locally Holder continuous of order a < 1/2. 

Definition. A continuous path X t = (X^\ . . . , x[ n ^) is called nowhere 
differ entiable, if for all i, each coordinate function X^ 1 ' is not differentiable 
at t. 



Theorem 4.2.5 (Wiener). Brownian motion is nowhere differentiable: for 
almost all us, the path 1 1— > Xt(uj) is nowhere differentiable. 



Proof. We follow [68]. It is enough to show it in one dimensions. Suppose 
B t is differentiable at some point < s < 1. There exists then an integer I 
such that \B t — B s \ < l(t — s) for t — s > small enough. But this means 
that 

\ B j/n - B (j-l)/n\ < 7- 

for all j satisfying 

i = [ns] + 1 < j < [ns] + 4 = i + 3 

and sufficiently large n so that the set of differentiable paths is included in 
the set 

b =u u n u n {i%n-so--D/„i<7^}. 

2>1 m>l n>m Q<i<n+1 i<j<i+3 



4.2. Some properties of Brownian motion 
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Using Brownian scaling, we show that P[B] =0 as follows 

P [fl U (1 {\ B j/n-B (j - 1)/n \<l 1 -}] 

n > 77?. 0<i<n+l i<j<i-f-3 

l 



3 

n J 



< liminf nP[\B 1/n \ < 7-] 

n— J-oo 

= liminf jiP[|Bi | < 7-^=] 
n-»oo yj n 

C 

< lim — = . 

ra-foo x /n 



3 



□ 



Remark. This proposition shows especially that we have no Lipshitz con- 
tinuity of Brownian paths. A slight generalization shows that Brownian 
motion is not Holder continuous for any a > 1/2. One has just to do the 
same trick with k instead of 3 steps, where k(a — 1/2) > 1. The actual 
modulus of continuity is very near to a = 1/2: \B t — £?t+e| is of the order 



h(e) = y2elog(i) . 

More precisely, P[limsup e _ ) , sup| a _ t | <e ^fey^ = 1] = lj as we will see 
later in theorem (4.4.2). 

The covariance of standard Brownian motion was given by E[B s Bt] = 
min{s, t}. We constructed it by implementing the Hilbert space £ 2 ([0, oo)) 
as a Gaussian subspace of C 2 (Q, A, P). We look now at a more general class 
of Gaussian processes. 

Definition. A function V : T x T — > M is called positive semidefinite, 

if for all finite sets {t\, . . . ,td} C T, the matrix Vij = V(ti,tj) satisfies 
(u, Vu) > for all vectors u = (mi, . . . , u n ). 



Proposition 4.2.6. The covariance of a centered Gaussian process is positive 
semidefinite. Any positive semidefinite function V on TxT is the covariance 
of a centered Gaussian process X t . 



Proof. The first statement follows from the fact that for all u = (ui, . . . , u n ) 

n 

J2V{U,tj)uiUj = E[(^u,A ti ) 2 ] > . 

i,j i=l 

We introduce for t £ T a formal symbol St- Consider the vector space of 
finite sums X)"=i a ^*i with inner product 

d d 
i=l j=l i,j 
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This is a positive semidefinite inner product. Multiplying out the null vec- 
tors {Hi 1 1 1 = } and doing a completion gives a separable Hilbert space 
H. Define now as in the construction of Brownian motion the process 
X t = X(St). Because the map X : H — s- C 2 preserves the inner product, we 
have 

E[X t ,X s ] = (6 s ,5 t ) = V(s,t) . 

□ 

Lets look at some examples of Gaussian processes: 

Example. The Ornstein-Uhlenbeck oscillator process X t is a one-dimensional 
process which is used to describe the quantum mechanical oscillator as we 
will see later. Let T = M + and take the function V(s,t) = ie~'*~ s l on 
T x T. We first show that V is positive semidefinite: The Fourier transform 
of f(t) = e— is 



2n(k 2 + 1) ' 
By Fourier inversion, we get 

[ {k 2 + 1)- V fc( '- S > dk = ie-l^l , 

27T Jo 2 



and so 



o < (27T)- 1 y* (fc 2 + 1) 



-1 L. J.kU 1 2 



^ \ U 3 e 



dk 



n 1 



J 

-\tj-t k \ 



This process has a continuous modification because 

E[(X t -X s ) 2 } = (e-l*-*! +e-l s - ;i l -2e-l'- s l)/2 = (l-e-l*- s l) < \t - s\ 

and Kolmogorov's criterion. The Ornstein-Uhlenbeck is also called the os- 
cillatory process. 



Proposition 4.2.7. Brownian motion B t and the Ornstein-Uhlenbeck pro- 
cess Ot are for t > related by 



1 

-=e "±s„2t . 
V2 



O t = -^e-*B, 



Proof. Denote by O the Ornstein-Uhlenbeck process and let 

X t = 2- 1/2 e- t B e 2t . 



4.2. Some properties of Brownian motion 
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We want to show that X = Y. Both X and O are centered Gaussian, 
continuous processes with independent increments. To verify that they are 
the same, we have to show that they have the same covariance. This is a 
computation: 

E[O t O s ] = i e -*e- s min{e 2t ,e 2s } = e^ /2 . 

□ 



It follows from this relation that also the Ornstein-Uhlenbeck process is 
not diffcrcntiablc almost everywhere. There are also generalized Ornstein- 
Uhlenbeck processes. The case V(s, t) — f„ e - lk (t- s ) d[i(k) = (i(t — s) 

with the Cauchy measure fi = 2l r(k 2 +i) ^ x 011 ^ can ^ e S eneranze( i to take 
any symmetric measure /i on M and let fi denote its Fourier transform 
J R e~ lkt d/j,(k). The same calculation as above shows that the function 
V{s,t) = jl(t — s) is positive semidefmitc. 



Figure. Three paths of the 
Ornstein-Uhlenbeck process. 




Example. Brownian bridge is a onc-dimcnsional process with time T = 
[0, 1] and V(s,t) = s(l - 1) for 1 < s < t < 1 and V(s,t) = V{t, s) else. It 
is also called tied down process. 

In order to show that V is positive semidefmitc, one observes that X t = 
B s — sBi is a Gaussian process, which has the covariance 

E[X s X t ] = E[(B S - sBi)(B t - tB{)\ = s + st - 2st = s(l - t) . 

Since E[X 2 ] = 0, we have X\ = which means that all paths start from 
at time and end at 1 at time 1. 

The realization X t = B s — sB\ shows also that X t has a continuous real- 
ization. 
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Figure. Three paths of Brownian 
bridge. 




Let X t be the Brownian bridge and let y be a point in R d . We can consider 
the Gaussian process Y t = ty + X t which describes paths going from at 
time to y at time 1. The process Y has however no more zero mean. 
Brownian motion B and Brownian bridge X are related to each other by 
the formulas: 

B t = B t := (t + l)X t /( t+1) , X t = X t := (1 - i)£? t/(1 _ t ) . 

These identities follow from the fact that both are continuous centered 
Gaussian processes with the right covariance: 

E[B s B t ] = (t + l)(s + l)mm{-^-, T ^-}=min{s,t} = E[B s B t ], 
E[X s X t ] = (l-t)(l- s )min{-4-.,-^y } = s(l-t) = E[X s X t ] 
and uniqueness of Brownian motion. 



Example. If V(s,t) = l{ s=t \, we get a Gaussian process which has the 
property that X s and Xt are independent, if s ^ t. Especially, there is no 
autocorrelation between different X s and Xt- This process is called white 
noise or "great disorder". It can not be modified so that (t, u>) H> Xt{u>) is 
measurable: if (t,u) *-> X t {uj) were measurable, then Y t = J Q X S ds would 
be measurable too. But then 

t t t' 

E[lf ] = E[( f X s ) 2 } = I [ E[X S ,X S ,) ds' ds = 
Jo Jo Jo 

which implies Yt — almost everywhere so that the measure dfJ,(uj) = 
X s (uj) ds is zero for almost all ui. 

t = E[[ X' 2 S ] = E[ f X S X S ds] =E[[ X s dfx(s)] = . 
Jo Jo Jo 

In a distributional sense, one can see Brownian motion as a solution of 
the stochastic differential equation and white noise as a generalized mean- 
square derivative of Brownian motion. We will look at stochastic differential 
equations later. 



4.3. The Wiener measure 
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Example. Brownian sheet is not a stochastic process with one dimensional 
time but a random field: time T = IR?_ is two dimensional. Actually, as 
long as we deal only with Gaussian random variables and do not want to 
tackle regularity questions, the time T can be quite arbitrary and proposi- 
tion (4.2.6) stated at the beginning of this section holds true. The Gaussian 
process with 

V((si,s 2 ), (h,t 2 )) = min(si,ii) ■ min(s 2 ,i 2 ) 

is called Brownian sheet. It has similar scaling properties as Brownian mo- 
tion. 



Figure. Illustrating a sample of a 
Brownian sheet B- 



t,s 

dimensional. Every 
B tiSo or B t = 5 Mc 
Brownian motion. 



Time is two 
trace B t = 
is standard 



4.3 The Wiener measure 

Let {E,£) be a measurable space and let T be a set called "time". A 
stochastic process on a probability space (Q,A, P) indexed by T and with 
values in E defines a map 

<t> : -> E T , co i-> X t (u) . 

The product space E T is equipped with the product cr-algebra £ T , which 
is the smallest algebra for which all the functions X t are measurable which 
is the cr-algebra generated by the 7r-system 

n 

{ II A ti ={xeE T ,x ti eA ti }\A ti eS} 
ti,...,t„ 

consisting of cylinder sets. Denote by Y t (w) = w(t) the coordinate maps on 
E T . Because Y t o <f> is measurable for all t, also <j> is measurable. Denote by 
Px the push-forward measure of 4> from {Q,A,P) to (E T ,£ T ) defined by 
P X [A] = P[X- 1 (A)}. For any finite set (ti, . . . , t n ) C T and all sets Ai G £, 
we have 

P[X U g Ai,i = l,...,n) = P x [Y u G Ai, 1 = l,...n] . 
One says, the two processes X and Y are versions of each other. 
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Definition. Y is called the coordinate process of X and the probability 
measure Px is called the law of X. 

Definition. Two processes X, X' possibly defined on different probability 
spaces are called versions of each other if they have the same law Px = Px' ■ 

One usually does not work with the coordinate process but prefers to work 
with processes which have some continuity properties. Many processes have 
versions which are right continuous and have left hand limits at every point. 

Definition. Let D be a measurable subset of E T and assume the process 
X has a version X such that almost all paths X(u>) are in D. Define the 
probability space (D, £ T n D, Q), where Q is the measure Q = 4>*P where 
4> : £1 — > D has the property that <j>(ui) is the version of u> in D . Obviously, 
the process Y defined on (D,£ T n D,Q) is another version of X. If D is 
right continuous with left hand limits, the process is called the canonical 
version of X. 



Corollary 4.3.1. Let E = M. d and T = K + . There exists a unique probability 
measure W on C(T, E) for which the coordinate process Y is the Brownian 
motion B. 



Proof. Let D = C(T,E) C E T . Define the measure W = <j>*P x and let 
Y be the coordinate process of B. Uniqueness: assume we have two such 
measures W, W and let Y, Y' be the coordinate processes of B on D with 
respect to W and W . Since both Y and Y' are versions of X and "being 
a version" is an equivalence relation, they are also versions of each other. 
This means that W and W coincide on a n- system and are therefore the 
same. □ 

Definition. If E = R d and T = [0, oo), the measure W on C(T, E) is called 
the Wiener measure. The probability space (C(T, E), £ T n C(T, E), W) is 
called the Wiener space. 

Let B' be the cr-algebra £ T n C (T, E) , which is the Borcl er-algcbra re- 
stricted to C(T,E). The space C(T,E) carries an other er-algcbra, namely 
the Borel cr-algebra B generated by its own topology. We have B C B' , 
since all closed balls {/ g C(T, E) \ \f - f \ < r} e B are in B' . The other 
relation B' C B is clear so that B = B' . The Wiener measure is therefore a 
Borel measure. 

Remark. The Wiener measure can also be constructed without Brownian 
motion and can be used to define Brownian motion. We sketch the idea. 
Let S = R" denote the one point compactification of E™. Define £1 = S^ '*! 



4.4. Levy's modulus of continuity 
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be the set of functions from [0, t] to S which is also the set of paths in K . 
It is by Tychonov a compact space with the product topology. Define 

C fin (fl) = {0 G ff(fi,R) | 3F : R" -> R, 0(w) = F(w(ti), . . . , w(t„))} ■ 

Define also the Gauss kernel y, i) = (47rf)~™/ 2 cxp(— |x— ?/| 2 /4t). Define 
on Cfi„(Q) the functional 



with si = t\ and Sk = tk — tfe-i for fc > 2. Since £(</>) < ^(w)^, it 
is a bounded linear functional on the dense linear subspace Cfi n (fl) C 
C(f2). It is nonnegative and £(1) = 1. By the Hahn Banach theorem, it 
extends uniquely to a bounded linear functional on C(Q). By the Riesz 
representation theorem, there exists a unique measure /j, on C(Q) such that 
£(0) = J" 4>(uj) dfj,(uj). This is the Wiener measure on ft. 

4.4 Levy's modulus of continuity 

We start with an elementary estimate 



Lemma 4.4.1. 




• ••p(x m ^i, 



x 



s m ) dx\ ■ ■ ■ dx 



i n 




Proof. 




For the right inequality consider 




Integrating by parts of the left hand side of this gives 




□ 
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Theorem 4.4.2 (Levy's modulus of continuity). If B is standard Brownian 
motion, then 

on- \B S — B t \ , 

P[lmiSUp SUp — rr — = 1J = 1 , 

e->0 \s-t\<e "■{€) 

where h(e) = ^/2elog(l/e). 



Proof. We follow [86]: 

(i) Proof of the inequality "> 1". 

Take < 6 < 1. Define o n = (1 - 5)h{2~ n ) = (1 - 5)s/n2\og2. Consider 
P[A„] = P[ max^ iBfca-n - B( fc _i) 2 -»| < o„] . 

Because B^/2 n ~ Brk-i)/2 n are independent Gaussian random variables, we 
compute, using the above lemma (4.4.1) and 1 — s < e~ s 



POO -| 

P[A n ] < (1 - 2 / ^=e- 2 / 2 dx) 2 " 
Ja n V2ir 

< (i _ 2 -^ e - a "/ 2 ) 2 " 



< exp („2™4^e" Q "/ 2 ) < e-CexpCnfl-d-^) 2 )/^) 
< + 1 

where C is a constant independent of n. Since 5Z„P[-<4 n ] < oo, we get by 
the first Borel-Cantclli that P[limsup„ A n ] = so that 

P[hm max \B k2 - n - S (fe _ 1)2 „| > h(2~ n )} = 1 . 
(ii) Proof of the inequality " < 1" . 

Take again < S < 1 and pick e > such that (1 + e)(l - 5) > (1 + <5). 
Define 

P[A n ] = P[ max iB^-n - B i2 - n \/h(k2- n ) > (1 + e)] 

k=j—i£K 

= P[ (J {l^-n-^-nlJ^fc}, 

k=j—ieK 



where 



Jf = {0 < fc < 2™ 5 } 



and a„, fe = /i(fc2" n )(l + e). 

Using the above lemma, we get with some constants C which may vary 
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from line to line: 



,/2 



keK 



< C- log(* _1 2 n ) _1 '" 



/2 p -(l+ e ) 2 log(fc- 1 2") 

< C • 2-™( 1 - 5 )( 1+e ) 2 ^ (log^fc-^™))- 1 / 2 ( since k' 1 > 2- n<5 ) 

fceif 

< C • n~ 1/2 2 n(,5_(1_,5)(1+e)2) . 

In the last step was used that there are at most 2 nS points in K and for 
each of them logO"^™) > log(2 n (l - 5)). 

We see that ^ n P[A n ] converges. By Borel-Cantelli we get for almost every 
lo an integer n(w) such that for n > n(to) 

\B j2 -n - B l2 -„ I < (1 + e) • h(k2~ n ) , 

where k = j — i G K . Increase possibly n(uj) so that for n > n(uj) 

m>n 

Pick < h < t 2 < 1 such that t = t 2 - *i < 2- n ^ ( - 1 - 5 \ Take next 
n > n(uj) such that 2- ( - n+1 ^ 1 - s '> < t < Z^ 1 -^ and write the dyadic 
development of t\,t 2 : 

ti = i2~ n - 2- pi - 2~ P2 ... ,t 2 = j2~ n + 2- qi + 2~ q2 . . . 

with h < i2~ n < j2~ n < t 2 and < k = j - i < t2 n < 2 nS . Wc get 

\B tl (u;) - B t2 (u,)\ < \B tl - B i2 - n (u>)\ 

+|B ia -n(w)-B 3 - 2 -n(w)| 
+\B j2 - n (uj) - B t2 \ 

< 2^(l + e)/i(2- f ) + (l + #2-") 

p>n 

< (I + 3e + 2e 2 )h(t) . 

Because e > was arbitrary, the proof is complete. □ 



4.5 Stopping times 

Stopping times are useful for the construction of new processes, in proofs 
of inequalities and convergence theorems as well as in the study of return 
time results. A good source for stopping time results and stochastic process 
in general is [86]. 

Definition. A filtration of a measurable space (f2, .4) is an increasing family 
(At)t>o of sub-cr-algebras of A. A measurable space endowed with a filtra- 
tion (At)t>o is called a filtered space. A process X is called adapted to the 
filtration At , if X t is _4 t -mcasurable for all t. 
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Definition. A process X on (f2, A, P) defines a natural nitration At = 
a(X s | s < t), the minimal filtration of X for which X is adapted. Heuris- 
tically, At is the set of events, which may occur up to time t. 

Definition. With a filtration we can associate two other filtration by setting 
for t > 

A t - = cr(A s ,s< t),At+ = f] A s . 

s>t 

For t = wc can still define Aq+ = f] s>0 A s and define Aq- = Aq. Define 
also Aoo = cr(A s ,s > 0). 

Remark. We always have A t - C At C A t + and both inclusions can be 
strict. 

Definition. If At = A t + then the filtration At is called right continuous. If 
At = At-, then At is left continuous. As an example, the filtration A t + of 
any filtration is right continuous. 

Definition. A stopping time relative to a filtration At is a map T : Q — > 

[0, oo] such that {T < t } S A t . 

Remark. If At is right continuous, then T is a stopping time if and only 
if {T < t } G -4t- Also T is a stopping time if and only if Xt = 1(o,t](£) is 
adapted. X is then a left continuous adapted process. 

Definition. If T is a stopping time, define 

At = {A <e Aoc \ An{T <t} e A t ,Vt} . 

It is a cr-algebra. As an example, if T = s is constant, then At = A s . Note 
also that 

A T + = {Ae Aco \ An {T < t} e A t M} . 
Wc give examples of stopping times. 



Proposition 4.5.1. Let X be the coordinate process on C(K+, E), where E 
is a metric space. Let A be a closed set in E. Then the so called entry time 

T a {uj) = inf{< > | X t {uj) e A } 

is a stopping time relative to the filtration At = u({X s } s <t)- 



Proof. Let d be the metric on E. We have 

{T A <t} = { inf d(X s (u,),A) = 0} 

s£Q,s<t 

which is in At — <j(X s ,x < t). □ 
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Proposition 4.5.2. Let X be the coordinate process on D(M. + , E), the space 
of right continuous functions, where E is a metric space. Let A be an open 
subset of E. Then the hitting time 

Ta(u>) = inf{t > | X t (uj) £ A} 

is a stopping time with respect to the filtration A t + ■ 



Proof. Ta is a A t + stopping time if and only if {Ta < t} G At for all t. 
If A is open and X s (lu) G A, we know by the right-continuity of the paths 
that Xt(u>) G A for every t G [s, s + e) for some e > 0. Therefore 

{T A <t} = { inf X s g A } g A f . 

s£Q,s<t 

□ 

Definition. Let At be a filtration on (il, A) and let T be a stopping time. 
For a process X, we define a new random variable Xt on the set {T < oo } 

by 

X t (uj) = X t{u]) {uj) . 

Remark. We have met this definition already in the case of discrete time 
but in the present situation, it is not clear whether Xt is measurable. It 
turns out that this is true for many processes. 

Definition. A process X is called progressively measurable with respect to a 
filtration At if for all t, the map (s, w) H> X s (uj) from ([0, t] x Q, B([0, t] xAt) 
into (E,£) is measurable. 

A progressively measurable process is adapted. For some processes, the 
inverse holds: 



Lemma 4.5.3. An adapted process with right or left continuous paths is 
progressively measurable. 



Proof. Assume right continuity (the argument is similar in the case of left 
continuity). Write X as the coordinate process D([0, t],E). Denote the map 
(s,uj) i y X s (u>) with Y = Y(s,uj). Given a closed ball U G £ . We have to 
show that Y~ X {U) = {(s,w) | Y(s,ui) £ U} G B([0,t]) x A t . Given k = N, 
we define Eo,j/ = and inductively for k > 1 the fe'th hitting time (a 
stopping time) 



H k)U {uj) = inf{s G Q | E fc _i >a (w) < a < t, X s G U } 
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as well as the fc'th exit time (not necessarily a stopping time) 

E k ,u(u) = inf{s e Q\H ktU (u) <s<t, X 3 (£U} . 

These are countably many measurable maps from D([0, t], E) to [0, t]. Then 
by the right-continuity 

oo 

Y~ l {U) = (J {(«,<") I H k ,u(uj) < s < E k>u (uj)} 

k=l 

which is in B([0, t]) x A t . □ 



Proposition 4.5.4. If X is progressively measurable and T is a stopping 
time, then Xt is ^T-measurable on the set {T < oo}. 



Proof. The set {T < oo } is itself in At- To say that Xt is At- measurable 
on this set is equivalent with Xt ■ 1{t<*} G At for every i. But the map 

S:({T< t},A t n {T < t}) ->■ ([(M],£[0,t]) 

is measurable because T is a stopping time. This means that the map 
w i ^ (T(w),w) from (0,^.t) to ([0,t] x Q,B([0,t]) x ^4. t ) is measurable and 
Xt is the composition of this map with X which is B[0, t] x At measurable 
by hypothesis. □ 

Definition. Given a stopping time T and a process X, we define the stopped 
process {X T ) t {uj) = X TAt (u;). 

Remark. If At is a filtration then At/\T is a filtration since if T\ and T2 are 
stopping times, then T\ A Ti is a stopping time. 



Corollary 4.5.5. If X is progressively measurable with respect to At and 
T is a stopping time, then {X T ) t = X t/ \T is progressively measurable with 
respect to AtAT- 



Proof. Because t A T is a stopping time, we have from the previous propo- 
sition that X T is AtAT measurable. 

We know by assumption that <fi : (s,uj) 1— > X s (o;) is measurable. Since also 
"0 : (s,oj) 4 (sA is measurable, we know also that the composition 

(s,w) n- = X^( S)W )(u;) = <f>(ip(s,(jj),uj) is measurable. □ 
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Proposition 4.5.6. Every stopping time is the decreasing limit of a sequence 
of stopping times taking only finitely many values. 



Proof. Given a stopping time T, define the discretisation Tk — +oo if T > k 
and Tfc = q2~ k if (q - l)2~ fc <T <q 2~ k , q < 2 k k. Each T k is a stopping 
time and Tk decreases to T. □ 

Many concepts of classical potential theory can be expressed in an elegant 
form in a probabilistic language. We give very briefly some examples with- 
out proofs, but some hints to the literature. 

Let B t be Brownian motion in M. d and Ta the hitting time of a set Acl <! . 
Let Dbea domain in M. d with boundary 6(D) such that the Green function 
G(x,y) exists in D. Such a domain is then called a Green domain. 

Definition. The Green function of a domain D is defined as the fundamental 
solution satisfying AG(x, y) = 5(x — y), where 8{x — y) is the Dirac measure 
at y G D. Having the fundamental solution G. we can solve the Poisson 
equation Am = v for a given function v by 

u = / G(x,y) ■ v(y) dy . 
Jd 

The Green function can be computed using Brownian motion as follows: 

G(x,y) = / g{t,x,y) dt , 
Jo 

where for x G D, 

[ g(t, x, y) dy = P x [B t G C, T 5D > t] 
Jc 

and P x is the Wiener measure of Bt starting at the point x. 

We can interpret that as follows. To determine G(x, y), consider the killed 
Brownian motion B t starting at x, where T is the hitting time of the bound- 
ary. G(x,y) is then the probability density, of the particles described by the 
Brownian motion. 

Definition. The classical Dirichlet problem for a bounded Green domain 
D G M. d with boundary SD is to find for a given function / G C(8(D)), a 
solution u G C(D) such that Au = inside D and 

lim u(x) = f(y) 

for every y £ 8D. 
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This problem can not be solved in general even for domains with piecewise 
smooth boundaries if d > 3. 

Definition. The following example is called Lebesgue thorn or Lebesgue 

spine has been suggested by Lebesgue in 1913. Let D be the inside of a 
spherical chamber in which a thorn is punched in. The boundary 5D is 
held on constant temperature /, where / = 1 at the tip of the thorn y 
and zero except in a small neighborhood of y. The temperature u inside 
D is a solution of the Dirichlet problem Arju = satisfying the boundary 
condition u = / on the boundary SD. But the heat radiated from the thorn 
is proportional to its surface area. If the tip is sharp enough, a person sitting 
in the chamber will be cold, no matter how close to the heater. This means 
]immf x -> VtX £D u{x) < 1 = f(y)- (For more details, see [44, 47]). 

Because of this problem, one has to modify the question and declares u is 
a solution of a modified Dirichlet problem, if u satisfies A^u = inside D 
and lim x ^y^ e E> u(x) = f(y) for all nonsingular points y in the boundary 
SD. Irregularity of a point y can be defined analytically but it is equivalent 
with P y [Ti)c > 0] = 1, which means that almost every Brownian particle 
starting at y £ SD will return to SD after positive time. 



Theorem 4.5.7 (Kakutani 1944). The solution of the regularized Dirichlet 
problem can be expressed with Brownian motion B t and the hitting time 
T of the boundary: 

u(x) = E x [f(B T )} . 



In words, the solution u(x) of the Dirichlet problem is the expected value 
of the boundary function / at the exit point Bt of Brownian motion Bt 
starting at x. We have seen in the previous chapter that the discretized 
version of this result on a graph is quite easy to prove. 



Figure. To solve the Dirichlet 
problem in a bounded domain 
with Brownian motion, start the 
process at the point x and run it 
until it reaches the boundary Bt , 
then compute /(Bt) and aver- 
age this random variable over all 
paths lo. 
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Remark. Ikeda has discovered that there exists also a probabilistic method 
for solving the classical von Neumann problem in the case d = 2. For more 
information about this, one can consult [44, 81]. The process for the von 
Neumann problem is not the process of killed Brownian motion, but the 
process of reflected Brownian motion. 

Remark. Given the Dirichlet Laplacian A of a bounded domain D. One 
can compute the heat flow e~ tA u by the following formula 



where T is the hitting time of 5D for Brownian motion B t starting at x. 

Remark. Let if be a compact subset of a Green domain D. The hitting 
probability 



is the equilibrium potential of K relative to D. We give a definition of the 
equilibrium potential later. Physically, the equilibrium potential is obtained 
by measuring the electrostatic potential, if one is grounding the conducting 
boundary and charging the conducting set B with a unit amount of charge. 

4.6 Continuous time martingales 

Definition. Given a filtration At of the probability space (£l,A, P). A real- 
valued process X t £ C 1 which is At adapted is called a submartingale, if 
E[Xt|^4 s ] > X s , it is called a supermartingale if —X is a submartingale 
and a martingale, if it is both a super and sub-martingale. If additionally 
Xt <G C p for all t, we speak of C p super or sub-martingales. 

We have seen martingales for discrete time already in the last chapter. 
Brownian motion gives examples with continuous time. 



Proposition 4.6.1. Let B t be standard Brownian motion. Then B tl B\ — t 
and e aBt ~ a */ 2 are martingales. 



Proof. Bt — B s is independent of B s . Therefore 



-tA 



u)(x)=E x [u(Bt);t<T} , 




E[B t | A s ] -B s = E[B t - B S \A S ] = E[B t - B s ] = . 



Since by the "extracting knowledge" property 



E[B t B s | A 8 ] 



B s ■ E[B t | As] = , 



we get 



E[B t 2 -t\A s ]- (B 2 S - s) 



E[B 2 t - B 2 S | A 8 ] ~(t-s) 

E[(B t - B s f \A s ]-{t-s)=0. 
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Since Brownian motion begins at any time s new, we have 

from which 

E[e aBt \A s ]e- a2t / 2 = E[e aB °]e- a2s / 2 
follows. □ 

As in the discrete case, we remark: 

Proposition 4.6.2. If X t is a £ p -martingale, then \X t \ p is a submartingale 
for p > 1 . 



Proof. The conditional Jensen inequality gives 

n\X t \P\A s ]>\E[X t \A s ]\ p ^\X s \P. 

□ 

Example. Let X n be a sequence of IID exponential distributed random 
variables with probability density fx(x) = e~ cx c. Let S n — J2k=i The 
Poisson process Nt with time T = M. + = [0, oo) is defined as 

oo 

Nt = Yl !^<t ■ 
fc=i 

It is an example of a martingale which is not continuous, This process 
takes values in N and measures, how many jumps are necessary to reach 
t. Since E[N t ] = ct, it follows that N t — ct is a martingale with respect to 
the filtration At = &(N S , s < t). It is a right continuous process. We know 
therefore that it is progressively measurable and that for each stopping 
time T, also N T is progressively measurable. See [50] or the last chapter 
for more information about Poisson processes. 



Figure. The Poisson point pro- 
cess on the line. N t is the num- 
ber of events which happen up to Si % % % % % s, % % ?o 5i 



time t. It could model for exam- 
ple the number N t of hits onto a 
website. 
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Proposition 4.6.3. (Interval theorem) The Poisson process has independent 
increments 

oo 

N t -N S = J2 ^<s„<t ■ 
Moreover, Nt is Poisson distributed with parameter tc: 

P[Nt = k} = ^fe~ tC ■ 



Proof. The proof is done by starting with a Poisson distributed process Nt. 
Define then 

S n (u) = {t \N t = n,N t _ = n-l} 

and show that X n = S n — S n -i are independent random variables with 
exponential distribution. □ 

Remark. Poisson processes on the lattice Z d are also called Brownian mo- 
tion on the lattice and can be used to describe Feynman-Kac formulas for 
discrete Schrodinger operators. The process is defined as follows: take X t 
as above and define 



oo 



Yt = / , Z k l Sk <t , 
fe=i 

where Z n are IID random variables taking values in {m £ Z d ||m| = 1}. 
This means that a particle stays at a lattice site for an exponential time 
and jumps then to one of the neighbors of n with equal probability. Let 
P n be the analog of the Wiener measure on right continuous paths on the 
lattice and denote with E ra the expectation. The Feynman-Kac formula for 
discrete Schrodinger operators H = H + V is 

(e- ltH u)(n) = e 2dt E n [u(X t )i Nt e-^o v ^ ds ] . 



4.7 Doob inequalities 

We have already established inequalities of Doob for discrete times T = N. 
By a limiting argument, they hold also for right-continuous submartingales. 



Theorem 4.7.1 (Doob's submartingale inequality). Let X be a non-negative 
right continuous submartingale with time T = [a, b]. For any e > 

e ■ P[ sup X t > e] < E[X b ; { sup X t > e}] < E[X b ) . 

a<t<b a<t<b 
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Proof. Take a countable subset D of T and choose an increasing sequence 
D n of finite sets such that [j n D n = D. We know now that for all n 

e ■ P[ sup X t > e] < E[X 6 ; { sup X t > e}] < E[X b ] . 
teD„ teD n 

since E[X t ] is nondecreasing in t. Going to the limit n — > oo gives the claim 
with T = D. Since X is right continuous, we get the claim for T = [a, b]. □ 

One often applies this inequality to the non- negative submartingale \X\ if 
X is a martingale. 



Theorem 4.7.2 (Doob's LP inequality). Fix p > 1 and q satisfying + 
g _1 = 1. Given a non- negative right-continuous submartingale X with 
time T = [a,b] which is bounded in CP . Then X* — sup tgT Xt is in CP and 
satisfies 

||X*|| p < q ■ sup ||X t || p . 



Proof. Take a countable subset D of T and choose an increasing sequence 
D n of finite sets such that \J n D n = D. 
We had 

|| sup X t \\ < q ■ sup \\X t \\ p . 
Going to the limit gives 

1 1 sup X t \\ < q ■ sup ||X t || p . 
teD teD 

Since D is dense and X is right continuous we can replace D by T. □ 

The following inequality measures, how big is the probability that one- 
dimensional Brownian motion will leave the cone {(£, x), \x\ < a ■ t}. 



Theorem 4.7.3 (Exponential inequality). St = sup 0<s<t B s satisfies for any 
a > 

P[St > a-t] < e~ a2t ' 2 . 



Proof. We have seen in proposition (4.6.1) that Mt = e aBt "2 is a mar- 
tingale. It is nonnegativc. Since 

o?t Oct op's 
exp(aS t — ) < exp(supB s — ) < supexp(S s — ) = supM s , 

^ s<t * s<t * s<t 
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we get with Doob's submartingale inequality (4.7.1) 

P[S t >at] < P[supM s > e aat -^\ 

s<t 

a 2 t 

< exp(-aat + — )E[M t ] . 

The result follows from E[B t ] = E[£> ] = 1 an d inf Q>0 exp(— acrf + ^) = 
exp(-^). □ 

An other corollary of Doob's maximal inequality will also be useful. 
Corollary 4.7.4. For a, b > 0, 



P[ sup (B s -^-)>/3}< e' afi . 
se[o,i] ^ 



Proof. 

Cv Cvt, 

P[ sup (B, -—)>/?] < P[ sup (B s --)>f3] 
«e[o,i] ^ se[o,x] 2 

= P[ sup (e^-Tr) > e^ a ] 
«e[o,i] 

= P[ sup M s > e Pa ] 
se[o,i] 

< e~ Pa sup E[MJ = e~' 3a 
se[o,i] 

since E[M S ] = 1 for all s. □ 



4.8 Khintchine's law of the iterated logarithm 

Khinchine's law of the iterated logarithm for Brownian motion gives a pre- 
cise statement about how one-dimensional Brownian motion oscillates in a 
neighborhood of the origin. As in the law of the iterated logarithm, define 

A(t) = V2i log| logt| . 



Theorem 4.8.1 (Law of iterated logarithm for Brownian motion). 

B B 
Pflimsup —A = 11 = 1, Pfliminf — l - = -ll = 1 
1 t^o A(t) J 1 t-*J A(i) 1 
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Proof. The second statement follows from the first by changing Bt to — Bt- 

(i) lim sup s _j. < 1 almost everywhere: 
Take 6,6 £ (0, 1) and define 

a n = (l + 6)0-"A(6 n ), f3 n = ^p-. 

We have a n f3 n = loglog(6>")(l +6) = log(n) log(#). From corollary (4.7.4), 
we get 

P[sup(B s - ^) > p n ] < e"^" - . 

s<l ^ 

The Borel-Cantclli lemma assures 

P[liminf sup(B s - ^) < /?„] = f 

which means that for almost every cj, there is no(w) such that for n > no(w) 
and s £ [O,^"" 1 ), 

s 6 n ~ x (1+6) 1 

B.H < On g + A, < On— + A, - + ^)H0 n ) • 

Since A is increasing on a sufficiently small interval [0, a), we have for 
sufficiently large n and s £ (9 n , 9 n ^ 1 ] 

B.M<({^ + i)A(*). 
In the limit 6* — > I and 6 — > 0, we get the claim. 

(ii) limsup^o > 1 almost everywhere. 
For (9 g (0, 1), the sets 

A n = {B e , - B gn+1 > (1 - Vfl)A(0 n )} 

arc independent and since Bgn — Bgn+i is Gaussian we have 

„r , n f°° ,, 2 in du a 2 /0 

P A = / e- /2 -= > ^-je- /2 

7a V27T a 2 + 1 

with a = (1 — \/6)A(6 n ) < Kn~ a with some constants K and a < 1. 
Therefore ^2 n P[A n ] = oo and by the second Borel-Cantelli lemma, 

B 6 n > (1 - V6)A(6 n ) + B e n+i (4.1) 

for infinitely many n. Since —B is also Brownian motion, we know from (i) 
that 

-Bgn+i < 2A{6 n+1 ) (4.2) 

for sufficiently large n. Using these two inequalities (4.1) and (4.2) and 
A(6>' 1+1 ) < 2V6A(6 n ) for large enough n, we get 

Bgn > (1 - Ve)A(6 n ) - AA(6 n+1 ) > A(0™)(1 - y/d - We) 
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for infinitely many n and therefore 

liminf , * > limsup , , s „ ; > 1 — 5V0 ■ 
t->o A(t) " „^oo A((9«) 

The claim follows for 6> — ^ 0. □ 

Remark. This statement shows also that Bt changes sign infinitely often 
for t — > and that Brownian motion is recurrent in one dimension. One 
could show more, namely that the set {B t = } is a nonempty perfect set 
with Hausdorff dimension 1/2 which is in particularly uncountable. 

By time inversion, one gets the law of iterated logarithm near infinity: 



Corollary 4.8.2. 

B B 

Pflimsup — A =11 = 1, PHiminf — ---- = —11 = 1 

1 t^oo A(i) J 1 t^oo A(t) J 



Proof. Since B t = tB 1 / t (with B = 0) is a Brownian motion, we have with 
s = l/t 

, r r B iA 

1 = lim sup — — = lim sup s - 



s ^o A(s) s _> A(s) 

r r B * 

lim sup — — - — — = lim sup ■ 



t^oo *A(l/t) t->oc A(i) ' 
The other statement follows again by reflection. □ 

Corollary 4.8.3. For d-dimensional Brownian motion, one has 

B B 
Pflimsup ------- = 1] = 1, PfUminf — -At = -1] = 1 

1 t-vo A(*) J 1 t-vo A(t) J 



Proof. Let e be a unit vector in R . Then £? t • e is a 1-dimensional Brown- 
ian motion since B t was defined as the product of d orthogonal Brownian 
motions. From the previous theorem, we have 

Pflimsup * ; = 11 = 1 . 
1 tV A(t) J 

Since B t ■ e < \B t \, we know that the limsup is > 1. This is true for all 
unit vectors and we can even get it simultaneously for a dense set {e„}„ S N 
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of unit vectors in the unit sphere. Assume the limsup is 1 + e > 1. Then, 
there exists e„ such that 

P[u ^r%r- 1+ ^ ]=1 

in contradiction to the law of iterated logarithm for Brownian motion. 
Therefore, we have limsup = 1. By reflection symmetry, liminf = — 1. □ 

Remark. It follows that in d dimensions, the set of limit points of B t /A(t) 
for t — > is the entire unit ball {\v\ < 1}. 

4.9 The theorem of Dynkin-Hunt 

Definition. Denote by I(k, n) the interval [■fer, ^r). If T is a stopping time, 
then T^™) denotes its discretisation 

tWh = ^i ;m (th)- 

k=l 

which is again a stopping time. Define also: 

A T + = {A e Aoo \An {T < t } g A t ,Vt } . 

The next theorem tells that Brownian motion starts afresh at stopping 
times. 



Theorem 4.9.1 (Dynkin-Hunt). Let T be a stopping time for Brownian 
motion, then B t = B t +T — Bt is Brownian motion when conditioned to 
{T < oo} and B t is independent of At+ when conditioned to {T < oo}. 



Proof. Let A be the set {T < oo}. The theorem says that for every function 
f(Bt) = g(B t+tlt B t+t2 , ■ ■ ■ , B t +t„) 

with g G C(R") 

E[f(B t )l A }=E[f(B t )].P[A) 
and that for every set C E At+ 

E[f(B t )l AnC ] ■ P[A] = E[f(B t )l A ] ■ P[A n C] . 

This two statements arc equivalent to the statement that for every C G At+ 

E[f(B t )A Anc ]=E[f(B t )}-P[AnC}. 
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Let be the discretisation of the stopping time T and A n = \T^ n > < 00} 
as well as A n . k = {T^ = k/2 n }. Using A = {T < oo},P[U£Li A n , k r\C] -> 
P [A n C] for b4oo, we compute 

E[f(B t )l AnC ] = lim E[/(S TW )U„nc] 

n— >-oo 

oo 

= lim £E[/(B fc/2 »)l A „ ifcn0 ] 

CO 

= lim Ve[/(B )] -P[A n , k nC] 

n— >oo ' J 

00 

= E[/(i? )] lim P[M A n>k nC] 

k=l 

= E[f(B )l AnC } 

= E[f(B )]-P[AnC] 

= E[/(B t )].P[inC], 

□ 

Remark. If T < 00 almost everywhere, no conditioning is necessary and 
B t +T — Bt is again Brownian motion. 



Theorem 4.9.2 (Blumental's zero-one law). For every set A £ Aq+ we have 
P[A] = or P[A] = 1. 



Proof. Take the stopping time T which is identically 0. Now B = B t +r — 
Bt = B. By Dynkin-Hunt's result, we know that B = B is independent of 
Bt+ = Aq+ . Since every C £ Aq+ is {B s , s > 0} measurable, we know that 
Aq+ is independent to itself. □ 

Remark. This zero-one law can be used to define regular points on the 
boundary of a domain D £ M. d . Given a point y £ SD. We say it is regular, 
if P y [TsD > 0] = and irregular P v [Tsd > 0] = 1. This definition turns 
out to be equivalent to the classical definition in potential theory: a point 
y £ 6D is irregular if and only if there exists a barrier function / : N — > K 
in a neighborhood N of y. A barrier function is defined as a negative sub- 
harmonic function on int(iV PI D) satisfying f(x) — > for x — > y within 
D. 



4.10 Self-intersection of Brownian motion 



Our aim is to prove the following theorem: 
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Theorem 4.10.1 (Self intersections of random walk). For d < 3, Brownian 
motion has infinitely many self intersections with probability 1. 



Remark. Kakutani, Dvoretsky and Erdos have shown that for d > 3, there 
are no self- intersections with probability 1. It is known that for d < 2, there 
are infinitely many n— fold points and for d > 3, there are no triple points. 



Proposition 4.10.2. Let if be a compact subset of K d and T the hitting time 
of K with respect to Brownian motion starting at y. The hitting probability 
h(y) = P[y + B s £ K,T < s < oo] is a harmonic function on M. d \ K. 



Proof. Let Tg be the hitting time of Sg = {\x — y\ = 5}. By the law of 
iterated logarithm, we have Tg < oo almost everywhere. By Dynkin-Hunt, 
we know that B t = B t +T s — B t is again Brownian motion. 

If 8 is small enough, then y + B s ^ K for t < Tg. The random variable 
Bt s G Sg has a uniform distribution on Sg because Brownian motion is 
rotational symmetric. We have therefore 

h(y) = P[y + B s eK,s>T s ] 
= P[y + B Ts +BeK} 

h(y + x) djji{x) , 

where u, is the normalized Lebesgue measure on Sg. This equality for small 
enough 5 is the definition of harmonicity. □ 



Proposition 4.10.3. Let if be a countable union of closed balls. Then 
h(K, y) 1 for y K. 



Proof, (i) We show the claim first for one ball K = B r (z) and let R = \z— y\. 
By Brownian scaling B t ~ c • B t / c 2 . The hitting probability of K can only 
be a function f(r/R) of r/R: 

h(y, K) = P [y + B s e K, T < s] = P [cy + B s/c 2 g cK, T K < s] 

= P[cy + B s/C 2 G cK, T cK < s/c 2 } 

= P[cy + B s ,T cK <s] 

= h(cy,cK) . 
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We have to show therefore that f(x) — > 1 as x — > 1. By translation invari- 
ance, we can fix y = yo = (f , 0, . . . , 0) and change K a , which is a ball of 
radius a around (— a, 0, . . .). We have 

h(y ,K a ) = f(a/(l + a)) 

and take therefore the limit a — > oo 

lira f(x) = lim h(y , K a ) = h(y , 1 J K a ) 

= E[inf(B a )i < -1] = 1 

because of the law of iterated logarithm. 

(ii) Given y n — s- yo £ K . Then yo £ Kq for some ball Kq. 

liminf h(y n , K) > lim h(y n ,Ko) = 1 

n— foo n— foo 

by (i). □ 

Definition. Let [i be a probability measure on M 3 . Define the potential 
theoretical energy of fj, as 

Hv) = \x- dfi(x) dfi(y) . 



Given a compact set K CM 3 , the capacity of K is defined as 

( inf I(fx))- 1 , 

where M(i4T) is the set of probability measures on K . A measure on K 
minimizing the energy is called an equilibrium measure. 

Remark. This definitions can be done in any dimension. In the case d = 
2, one replaces \x — y| _1 by log|x — y]" 1 . In the case d > 3, one takes 
\x — y\~^ d ~ 2 \ The capacity is for d = 2 defined as exp(— inf M 1(h)) and for 
d > 3 as (m£ ft I( f i))-^- 2 h 

Definition. We say a measure \i n on R d converges weakly to /i, if for all con- 
tinuous functions f, J f dn. n — > J f dpi. The set of all probability measures 
on a compact subset E of R d is known to be compact. 

The next proposition is part of Frostman's fundamental theorem of poten- 
tial theory. For detailed proofs, we refer to [40, 82]. 



Proposition 4.10.4. For every compact set K C M. d , there exists an equilib- 
rium measure /x on K and the equilibrium potential f \x — y|~( d-2 ) du,(y) 
rsp. J log(\x — yl" 1 ) dfi(y) takes the value C(isT) -1 on the support K* of 
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Proof, (i) (Lower semicontinuity of energy) If n. n converges to ^, then 

liminf /(/i„) > /(/i) . 

n— > oo 

(ii) (Existence of equilibrium measure) The existence of an equilibrium mea- 
sure \i follows from the compactness of the set of probability measures on 
K and the lower semicontinuity of the energy since a lower semi-continuous 
function takes a minimum on a compact space. Take a sequence /x„ such 
that 

I {fin) -> inf I(fj,) . 

HEM(K) 

Then fx n has an accumulation point n. and I(u) < inf^MCK) 

(iii) (Value of capacity) If the potential <j>(x) belonging to fi is constant on 
K, then it must take the value C(A') -1 since 

J <f>(x) dfi(x) = I(fx) . 

(iv) (Constancy of capacity) Assume the potential is not constant C(if) _1 
on K* . By constructing a new measure on K* one shows then that one can 
strictly decrease the energy. This is physically evident if we think of (f> as 
the potential of a charge distribution u. on the set K. □ 



Corollary 4.10.5. Let be the equilibrium distribution on K. Then 

h(y,K) = cj> fl -C(K) 
and therefore h(y, K) > C(K) ■ inf xe ir \x — y\~ l . 



Proof. Assume first K is a countable union of balls. According to proposi- 
tion (4.10.2) and proposition (4.10.3), both functions h and <j>^ ■ C{K) are 
harmonic, zero at oo and equal to 1 on 8(K). They must therefore be equal. 
For a general compact set K, let {y n } be a dense set in K and let K e = 
{J n B e (y n ). One can pass to the limit £4 0. Both h(y,K e ) — > h(y,K) and 
inf x gif 5 \x — — s- inf^gif \x — y\ -1 arc clear. The statement C(K e ) 
C (K) follows from the upper semicontinuity of the capacity: if G„ is a se- 
quence of open sets with nG„ = E, then C(G n ) — !> C(E). 
The upper semicontinuity of the capacity follows from the lower semicon- 
tinuity of the energy. □ 
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Proposition 4.10.6. Assume, the dimension d = 3. For any interval J = 

[a,b], the set 

Bj(u)={B t (u)\t€ [a,b]} 
has positive capacity for almost all ui. 



Proof. We have to find a probability measure on Bi{oS) such that its 
energy I(u,(uj)) is finite almost everywhere. Define such a measure by 

dn,(A) = \ { -? e[a > b] \ B ° eA] l 



(6 -a) 



Then 



/(/*) = J J \x - y]- 1 dn{x)dp(y) = J J {b-a)- l \B s 



BA" 1 dsdt . 



To see the claim we have to show that this is finite almost everywhere, we 
integrate over Q which is by Fubini 

E[/(M)]= / / (b-ar^Bs-Bt}- 1 } dsdt 

J a J a 

which is finite since B s — B t has the same distribution as \/s — tB\ by 
Brownian scaling and since < oo in dimen- 
sion d > 2 and f b [ b \/s — t ds dt < oo. □ 

Now we prove the theorem 

Proof. We have only to show that in the case d = 3. Because Brownian 
motion projected to the plane is two dimensional Brownian and to the line 
is one dimensional Brownian motion, the result in smaller dimensions fol- 
low. 

(i) a = P[U W] , 8 > 2 B t = B s ]>0. 

Proof. Let K be the set Ute[o i] Bt- We know that it has positive capacity 
almost everywhere and that therefore h(B s ,K) > almost everywhere. 
But h(B s ,K) = a since B s+ 2 — B s is Brownian motion independent of 
B s ,0 < s < 1. 

(ii) ctr = P[Ute[o l] 2<t Bt = B s ] > for some T > 0. Proof. Clear since 
olt — > ct for T —t oo. 

(iii) Proof of the claim. Define the random variables X n = lc„ with 

C n = {uj | B t = B s , for some t £ [nT, nT + 1], s e [nT + 2,(n + 1)T] } . 

They are independent and by the strong law of large numbers Y] v X n = oo 
almost everywhere. □ 
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Corollary 4.10.7. Any point B s (lo) is an accumulation point of self-crossings 
of {B t (u)} t > . 



Proof. Again, we have only to treat the three dimensional case. Let T > 
be such that 

a T = P[ (J B t = B 8 ] > 

t£[0,l],2<T 

in the proof of the theorem. By scaling, 

P[B t = B s | ie [0,4«e [20,T0\] 

is independent of /3. We have thus self- intersections of the random walk in 
any interval [0, b] and by translation in any interval [a,b]. □ 

4.11 Recurrence of Brownian motion 

We show in this section that like its discrete brother, the random walk, 
Brownian motion is transient in dimensions d > 3 and recurrent in dimen- 
sions d < 2. 



Lemma 4.11.1. Let T be a finite stopping time and Rt(uj) be a rotation in 
R d which turns Bt(uj) onto the first coordinate axis 

Rt(ui)B t (oj) = (|Br(w)|,0,...0) . 

Then B t = RT(B t +T — Bt) is again Brownian motion. 



Proof. By the Dynkin-Hunt theorem, Bt = Bt+r — Bt is Brownian motion 
and independent of At- By checking the definitions of Brownian motion, 
it follows that if B is Brownian motion, also R(x)B t is Brownian motion, 
if R(x) is a random rotation on M. d independent of B t . Since Rt is At 
measurable and B t is independent of At, the claim follows. □ 



Lemma 4.11.2. Let K r be the ball of radius r centered at £ M. d with 
d > 3. We have for y ^ K r 

h(y,K r ) = (r/\y\) d - 2 . 
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Proof. Both h(y, K r ) and (r/\y\) d ~ 2 are harmonic functions which are 1 at 
5K r and zero at infinity. They are the same. □ 



Theorem 4.11.3 (Escape of Brownian motion in three dimensions). For 
d > 3, we have lim t _>. 00 \B t \ = oo almost surely. 



Proof. Define a sequence of stopping times T n by 

T n =inf{s>0| |B,| = 2 n }, 

which is finite almost everywhere because of the law of iterated logarithm. 
We know from the lemma (4.11.1) that 

B t = Rr n (B t+ T n — Bx n ) 

is a copy of Brownian motion. Clearly also \BtJ\ = 2™. 

We have B s G A" r (0) = {\x\ < r} for some s > T n if and only if Bt £ 

(2", . . . , 0) + K r (0) for some t > 0. 

Therefore using the previous lemma 

P[B S g K r (0); s>T n ]= P[B t e (2™, . . . , 0) + K r (0); t > 0] = i^;) d ~ 2 

which implies in the case r2~ n < 1 by the Borel-Cantelli lemma that for 
almost all u>, B a (uS) > r for s > T„. Since T n is finite almost everywhere, 
we get liminf s |jB a | > r. Since r is arbitrary, the claim follows. □ 

Brownian motion is recurrent in dimensions d < 2. In the case d = 1, this 
follows readily from the law of iterated logarithm. First a lemma 



Lemma 4.11.4. In dimensions d = 2, almost every path of Brownian motion 
hits a ball K r if r > 0: one has h(y, K) = 1. 



Proof. We know that /i(y) = h(y, K) is harmonic and equal to 1 on 8K. It 
is also rotational invariant and therefore h(y) = a + 61og \y\. Since h £ [0, 1] 
we have h(y) = a and so a = 1. □ 



Theorem 4.11.5 (Recurrence of Brownian motion in 1 or 2 dimensions). Let 
d < 2 and S be an open nonempty set in R d . Then the Lebesgue measure 
of {t\B t e S} is infinite. 
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Proof. It suffices to take S = K r (xo), a ball of radius r around xq. Since 
by the previous lemma, Brownian motion hits every ball almost surely, we 
can assume that xq = and by scaling that r = 1. 

Define inductively a sequence of hitting or leaving times T n , S n of the 
annulus {1/2 < |x| < 2}, where T x = inf{t | \B t \ = 2} and 

S n = w£{t > T n | \B t \ = 1/2} 
T„ = inf {t > S n -i | \B t \ = 2} . 

These are finite stopping times. The Dynkin-Hunt theorem shows that S n — 
T n and T n — 5 rl _i are two mutually independent families of IID random 
variables. The Lebesgue measures Y n = |7„| of the time intervals 

In = {t | \B t \ <l,T n <t.< T n+1 } , 

are independent random variables. Therefore, also X n = min(l,y n ) are 
independent bounded IID random variables. By the law of large numbers, 
X n = oo which implies Y n = oo and the claim follows from 

\{te[0,oo)\\B t \<l}\>J2T n . 

□ 

Remark. Brownian motion in M. d can be defined as a diffusion on M d with 
generator A/2, where A is the Laplacian on M d . A generalization of Brow- 
nian motion to manifolds can be done using the diffusion processes with 
respect to the Laplacc-Bcltrami operator. Like this, one can define Brown- 
ian motion on the torus or on the sphere for example. See [59]. 

4.12 Feynman-Kac formula 

In quantum mechanics, the Schrodinger equation ihu = Hu defines the 
evolution of the wave function u(t) = e~ ltH/h u(0) in a Hilbert space %. The 
operator H is the Hamiltonian of the system. We assume, it is a Schrodinger 
operator H = Hq + V, where Ho = —A/2 is the Hamiltonian of a free 
particle and V : M. d — >• M is the potential. The free operator Hq already is 
not defined on the whole Hilbert space H = L 2 (M. d ) and one restricts H to 
a vector space D(H) called domain containing the in H dense set Cg yo (K d ) 
of all smooth functions which are zero at infinity. Define 

D(A*) = {u 6 T~L | v H> (Av, u) is a bounded linear functional on D(A)}. 

If u € D(A*), then there exists a unique function w = A*u S H such that 
(Av, u) = (v,w) for all u € D(A). This defines the adjoint A* of A with 
domain D(A*). 

Definition. A linear operator A : D(A) C H — > H is called symmetric if 
(Au, v) = (u, Av) for all u, v £ D(A) and self-adjoint, if it is symmetric and 
D(A) = D(A*). 



4.12. Feynman-Kac formula 



239 



Definition. A sequence of bounded linear operators A n converges strongly 
to A, if A n u — > Au for all h£H. One writes A = s — lim n _ i . 00 A n . 

Define e A = 1 + A + A 2 /2\ + A 3 /3\ + We will use the fact that a 
sclf-adjoint operator defines a one parameter family of unitary operators 
t i ^ e ltA which is strongly continuous. Moreover, e ttA leaves the domain 
D(A) of A invariant. For more details, see [83, 7]. 



Theorem 4.12.1 (Trotter product formula). Given self-adjoint operators 
A,B defined on D(A),D(B) C H. Assume A + B is self-adjoint onD = 
D(A) n D{B), then 

e it(A+B) _ g _ Y[ m l e itA/n e itB/n\n _ 
n— > oo 

If A, B are bounded from below, then 

e -t(A+B) =s _ Um ( e -tA/n e -tB/n^n 



Proof. Define 

S t = e lt( - A+B \V t = e UA ,W t = e ltB , U t = V t W t 

and vt = Stv for v G D. Because A + B is self-adjoint on D, one has vt G D. 
Use a telescopic sum to estimate 

n-l 

\\(S t -U? /n M = \\T, U i/n( S t/n-U t/n )S^- 1 v\\ 
j=0 

< n Sup \\(S t / n - U t / n )v s \ \ . 
0<s<t 

We have to show that this goes to zero for n — > oo. Given u E D = 
D(A) n D(B), 

S„-l . A s , t/ s - 1 
hm u = i(^4 + £> )u = hm u 

s^O s s->0 s 

so that for each u E D 

Urn n-\\(S t/n -U t/n )u\\=0. (4.3) 

The linear space D with norm |||u||| = ||(j4 + B)u\\ + is a Banach 
space since A + B is self-adjoint on D and therefore closed. We have a 
bounded family {n(S t / n — U t / n )} n eN of bounded operators from D to H. 
The principle of uniform boundedness states that 

\\n(S t/n - U t/n )u\\ < C ■ \\\u\\\ . 
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An e/3 argument shows that the limit (4.3) exists uniformly on compact 
subsets of D and especially on {v s } s ^[o y t] C D and so nsup 0<s<t ||0St/„ — 
Ut/ n ) v s\\ = 0. The second statement is proved in exactly the same way. □ 

Remark. Trotter's product formula generalizes the Lie product formula 

A B 
lim (exp(— ) exp( — )) n = exp(A + B) 
n— ¥oo n n 

for finite dimensional matrices A,B, which is a special case. 



Corollary 4.12.2. (Feynman 1948) Assume H = Hq + V is self-adjoint on 
D(H). Then 

e- itH u{x ) = lim ( — f jS^x^,...,*^)^^ dxi . . . dXn 

where 

5 n (x ,a;i, . = - > -( — ) - V(Xi) . 

n 2 i n 

i—l ' 



Proof. (Nelson) From it = —iHqu, we get by Fourier transform u = i—^-u 
which gives u t (fc) = exrj(i^-^-t)uo(k) and by inverse Fourier transform 

e- ltHo u(x) = u t {x) = (2i:it)- d/2 [ e tl ^u{y) dy . 
The Trotter product formula 

e -it(H +V) = s _ Um (e «fl- /n e «V/njn 
n— >oo 

gives now the claim. □ 

Remark. We did not specify the set of potentials, for which Hq + V can be 
made self-adjoint. For example, V G C§°(W) is enough or V G L 2 (R 3 ) n 
L°°(R 3 ) in three dimensions. 

We have seen in the above proof that e~ ltH ° has the integral kernel Pt{x, y) = 

(2mt)- d l 2 e l -^r- . The same Fourier calculation shows that e tH ° has the 
integral kernel 

P t (x,y) = (27rt)- d / 2 e-^ , 

where <?t is the density of a Gaussian random variable with variance t. 
Note that even if u G L 2 (M. d ) is only defined almost everywhere, the func- 
tion u t (x) = e~ tH °u{x) = J P t (x — y)u(y)dy is continuous and defined 



4.12. Feynman-Kac formula 
everywhere. 
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Lemma 4.12.3. Given f 1: ...,/„ G L°°(R d )nL 2 (M. d ) and < s 1 < ■ ■ ■ < s„. 
Then 

(e- tlH °/i---e- t »^/n)(0)= / fi(B Sl )---f n (B Sn )dB, 

where t\ = si,U = Sj — Sj_i,i > 2 and the /i on the left hand side are 
understood as multiplication operators on L 2 (R d ). 



Proof. Since B S1 , B S2 — B Sl , . . . B Sn — B Sn l arc mutually independent 
Gaussian random variables of variance ti,t%, . . . ,t n , their joint distribu- 
tion is 

P tl (0,y 1 )P t2 (0,y 2 )...P tn (0,y n )dy 
which is after a change of variables y\ = x\, yt = Xi — Xi-\ 

P tl (0,X 1 )P t2 (xi,X2) ■ . .P tn (x n -i,x n ) dx . 

Therefore, 

/ fi(B Sl ) ■ ■ ■ f n (B Sn ) dB 

f P tl (0, y x )P t2 (0, y 2 )... P tn (0, y n )h( yi ) . . . /„(?/„) dy 




Pt 1 (0,xi)Pt 2 (x 1 ,x 2 ) ■ ■ . P tn (x n -i,x n )fi(xi) . .. f n {x n ) dx 



= (e- tlif "A---e-*" ffo /™)(0). 

□ 

Denote by dB the Wiener measure on C([0,oo),R d ) and with dx the 
Lebesgue measure on M. d . We define also an extended Wiener measure 
dW — dx x dB on C([0, oo), M d ) on all paths s t->W s = x + B s starting at 

x e K d . 



Corollary 4.12.4. Given / ,/i, ...,/„ G L°°(R d ) n L 2 (R d ) and < si < 
■ ■ ■ < s n . Then 

/ fo(W so ) ■ ■ ■ f n (W Sn ) dW = (%, e- tlffo /i • • ■ e-^'fn) . 
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Proof, (i) Case so = 0. From the above lemma, we have after the dB 
integration that 

fo(W So ) ■ ■ ■ f n (W s J dW = f f {x)e- t ^h{x)---e- t ^f n {x)dx 

(ii) In the case so > we have from (i) and the dominated convergence 
theorem 



/ 



fo(W So )---f n (W Sn ) dW 



lim / 1{\ X \ <R }(W ) 



■■f n (W s JdW 

t n H n 

,e ' -ji---k 

tiH e . . . e ~t„H 



f (,W S0 ) ■ ■ ■ f n (W s J dW 

= \imJJ e~ s " H n {lxl<R} , 
= (7o,e-* lffo /i--- e - t " H %) 



\imJf e- s « H n {lxl<R} ,e~ t ^f 1 ---e- t - H «f n (x)) 



□ 



We prove now the Feynman-Kac formula for Schrodinger operators of the 
form H = H + V with V G Cg° (K d ). Because V is continuous, the integral 
Jo* V(W s (ijj)) ds can be taken for each u as a limit of Riemann sums and 
Jo* ^O^s) ds certainly is a random variable. 



Theorem 4.12.5 (Fcynman-Kac formula). Given H = Hq + V with V £ 



C£°(R d ). then 



(f,e~ tH g) = / f{W Q )g{W t )e-^y^s))ds m _ 



Proof. (Nelson) By the Trotter product formula 

(f,e- tH g)= lim (/, {e^ I n e~ tv ' n ) n 9) 

n—*oo 

so that by corollary (4.12.4) 

f- t " _1 

(f,e- tH g) = lim / f(W )g(W t ) exp(— V V(W tj/n )) dW (4.4) 

and since s i— > W s is continuous, we have almost everywhere 



/6— i „J 

y>(Wi j/n )-> / F(W S ) ds 
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The integrand on the right hand side of (4.4) is dominated by 

\f(W )\-\g(W t )\-e^°° 
which is in ^(dW) because again by corollary (4.12.4), 

/ \f(W )\ ■ \g(W t )\ dW = (\fle- tH °\g\) < oo . 

The dominated convergence theorem (2.4.3) leads us now to the claim. □ 

Remark. The formula can be extended to larger classes of potentials like 
potentials V which are locally in L 1 . The selfadjointness, which needed in 
Trotter's product formula, is assured if V G L 2 n LP with p > d/2. Also 
Trotter's product formula allows further generalizations [97, 32]. 

Why is the Feynman-Kac formula useful? 

• One can use Brownian motion to study Schrodinger semigroups. It al- 
lows for example to give an easy proof of the ArcSin-law for Brownian 
motion. 

• One can treat operators with magnetic fields in a unified way. 

• Functional integration is a way of quantization which generalizes to 
more situations. 

• It is useful to study ground states and ground state energies under 
perturbations. 

• One can study the classical limit h — > 0. 



4.13 The quantum mechanical oscillator 

The one-dimensional Schrodinger operator 

Id 2 1 2 1 
2dx^ + 2 X ~2 



H = H a + U = — ~ — 7~tt + ^x 2 - 



is the Hamiltonian of the quantum mechanical oscillator. It is a quantum 
mechanical system which can be solved explicitly like its classical analog, 
which has the Hamiltonian H(x,p) = \p 2 + ^x 2 — i. 

One can write 

H = AA* - l = A*A, 

with 

A* — 1 (x — ) A — 1 (x + - ) 
i/2 dx ' \/2 dx 

The first order operator A* is also called particle creation operator and A, 
the particle annihilation operator. The space of smooth functions of 
compact support is dense in L 2 (R). Because for all u, v G C§°(IR) 



{Au, v) = (u, A*v) 
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the two operators are adjoint to each other. The vector 



fi = — 

TT-L 



1 



/4 



e 



72 



is a unit vector because fig is the density of a N(0, l/y/2) distributed ran- 
dom variable. Because Aflo = 0, it is an eigenvector of H = A* A with 
eigenvalue 1/2. It is called the ground state or vacuum state describing the 
system with no particle. Define inductively the n-particle states 



f2 n — fi ra — 1 



by creating an additional particle from the (n — l)-particle state f2 r 



Figure. The first Hermit func- 
tions Q n . They are unit vectors 
in L 2 (R) defined by 



H n (x)uj (x) 
\/2 n n\ 



where H n (x) are Hermite poly- 
nomials, Hq(x) = l,Hi(x) = 
2x,H 2 (x) = 4a; 2 - 2,H 3 (x) = 
8x 3 - 12x, .... 




Theorem 4.13.1 (Quantum mechanical oscillator). The following properties 
hold: 

a) The functions are orthonormal (Q n ,Q m ) — S n . m . 

b) Afl n = s/nCl n -i,A*fl„ = \fn + lfl n +i- 

c) (n — 5) are the eigenvalues of H 

H = (A*A--)Sl n = (n--)n n 

d) The functions f2 n form a basis in i 2 (M). 



Proof. Denote by [A, B] = AB — BA the commutator of two operators A 
and B. We check first by induction the formula 

[A,(A*) n }=n-(A*) n - 1 ■ 
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For n = 1, this means [A, A*] = 1. The induction step is 

[A, (A*) n ] = [A, (A*) n ^ 1 ]A* + (A*)"- 1 ^^*} 

= (n-l)( J 4*)' 1 - 1 + ( J 4*)"- 1 =n( J 4*) n - 1 . 

a) Also 

((A*) n Q , (A*) m Q ) = n\S mn . 

can be proven by induction. For n = it follows from the fact that fi is 
normalized. The induction step uses [A, (A*) n ] = n- (A*)™ -1 and A£l = 0: 

((A*) n n Q , (A*) m n ) = (A(A*) n n (A*) m ^ 1 n ) 

= ([A^ATPaiA*)™- 1 ^) 
= n((A*) n ~ 1 Sl > (A*) m_1 ^o) • 

If ?i < m, then we get from this after n steps, while in the case n = m, 
we obtain ((A*) n n Q , (A*) n n ) = n ■ ((A*)™ _1 f2 , (A*)"^ 1 ^), which is by 
induction n(n — l)!5„_x,n-i = Til. 

b) A*fl n = \Jn + 1 • Qn+i is the definition of fl n - 

AVL n = -=A(A*) n n = -^=nVL a = yfaVi . 

vn! 



i- 



c) This follows from b) and the definition fi„ = ^=A*Q„_ 



d) Part a) shows that {^ n }^Lo ^ is an orthonormal set in L 2 (R). In order 
to show that they span L 2 (K), we have to verify that they span the dense 
set 

S = {/ 6 C£°(R) | x m f {n) {x) -> 0, |x| oo,Vm,ne N } 

called the Schwarz space. The reason is that by the Hahn-Banach theorem, 
a function / must be zero in L 2 (R) if it is orthogonal to a dense set. So, 
lets assume (/, f2„) = for all n. Because A* + A = \/2x 

= Vn~w (/, fi„) = (/, (^*)"f2o) = (/, (A* + A) n r» ) = 2"/ 2 (/, x n Sl ) 
we have 



(/fiofa) = / f(x)n (x)e ikx dx 

J — oo 

= (/, fioe*-) = (/, 53^0o 



n! 

n>0 



^M_ (/ja; n 0o) = 

* — * T7 1 



71 
ra>0 



and so /fio = 0. Since £Iq(x) is positive for all x, we must have / = 0. This 
finishes the proof that we have a complete basis. □ 
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Remark. This gives a complete solution to the quantum mechanical har- 
monic oscillator. With the eigenvalues {A„ = n— l/2}^ =0 and the complete 
set of eigenvectors f2 n one can solve the Schrodinger equation 

d 

ih—u = Hu 
dt 

by writing the function u(x) = X^^Lo u n^n(x) as a sum of eigenfuctions, 
where u n = (u,Q n ). The solution of the Schrodinger equation is 

oo 

u(t,x) = Y,u n e lh{n - 1/2)t n n (x). 

n=0 



Remark. The formalism of particle creation and annihilation operators 
can be extended to some potentials of the form U(x) = q 2 (x) — q'{x) the 
operator H = —D 2 /2 + U/2 can then be written as H = A* A, where 



A* 



1 



V2 



dx 



), A 



1 



V2 



(9(a) + t) 
dx 



The oscillator is the special case q(x) = x. See [12]. The Backlund transfor- 
mation H = A* A t—^H = AA* is in the case of the harmonic oscillator the 
map H M> H + 1 has the effect that it replaces U with U = U — d 2 logfioj 
where fio is the lowest eigenvalue. The new operator H has the same spec- 
trum as H except that the lowest eigenvalue is removed. This procedure 
can be reversed and to create "soliton potentials" out of the vacuum. It 
is also natural to use the language of super-symmetry as introduced by 
Witten: take two copies Hf ® Hb of the Hilbert space where " /" stands for 
Fermion and "6" for Boson. With 






A* ' 


,P = 


' 1 


A 





-1 



one can write H ® H = Q 2 , P 2 = 1, QP + PQ = and one says (H, P, Q) 
has super-symmetry. The operator Q is also called a Dirac operator. A 
super-symmetric system has the property that nonzero eigenvalues have 
the same number of bosonic and fcrmionic eigenstates. This implies that H 
has the same spectrum as H except that lowest eigenvalue can disappear. 



Remark. In quantum field theory, there exists a process called canonical 
quantization, where a quantum mechanical system is extended to a quan- 
tum field. Particle annihilation and creation operators play an important 
role. 



4.14 Feynman-Kac for the oscillator 

We want to treat perturbations L = Lq + V of the harmonic oscillator 
Lq with an similar Feynman-Kac formula. The calculation of the integral 
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kernel pt(x, y) of e~ tL ° satisfying 



(e-^° /)(*)= / Pt(x,y)f(y) dy 

is slightly more involved than in the case of the free Laplacian. Let fio be 
the ground state of Lq as in the last section. 



Lemma 4.14.1. Given /o,/i, • • • , f n G L°°(R) and — oo < ,s < s i < ■ ■ ■ < 
s n < oo. Then 

(f2 ,/oe- tlio /i---e- t " I "'/nOo)= / fo(Q. ) ■ ■ ■ fn(Qs n ) dQ , 
where t = s ,^ = Sj — Sj_i,i > 1. 

Proof. The Trotter product formula for Lq = Hq + U gives 
(Oo,/oe- tlio /i---e-*" L %fio) 

lim (Oo, / ( e -*^o/m le -* lC // mi)mi/i . . . e -tnHo fM 

77i — ( 771 1 , . . . , m n ) , m i — y oo 

fo(xo)--- fn(x n ) dG m (x,y) 

and G rn is a measure. Since e~ tH ° has a Gaussian kernel and e~ tu is a 
multiple of a Gaussian density and integrals are Gaussian, the measure dG m 
is Gaussian converging to a Gaussian measure dG. Since Lq(xH,q) = .tOq 
and (xQq, x£Iq) = 1/2 we have 

dG = (xn ,e-^- s ^L xn ) = I e - (Sj - Si) 

which shows that dG is the joint probability distribution of Q So , . . ■ Q Sn - 
The claim follows. □ 



Theorem 4.14.2 (Mehler formula). The kernel pt(x, y) of Lq is given by the 
Mehler formula 

1 / (x 2 + y 2 )(l + er 2t ) - Axye~ 

Pt{x,y) = - T =exp 



2a 2 



with cr 2 = (1 - e~ 2t ) 
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Proof. We have 

(f,e- tLtt g)= [ f(y)% 1 (y)g(x)n~ 1 (x)dG(x,y)= [ f(y)p t (x,y)dy 



with the Gaussian measure dG having covariancc 

1 



-4 



1 e~* 

3"* 1 



We get Mehler's formula by inverting this matrix and using that the density 
is 

(2T:)det(A)- 1 ^e- {{x ^' A{x ' y)) . 

□ 

Definition. Let dQ be the Wiener measure on C(R) belonging to the os- 
cillator process Qt- 



Theorem 4.14.3 (Feynman-Kac for oscillator process). Given L = Lq + V 
with V e Cg°(R), then 

(/Oo, e- 4i 9 n ) = J 7(Qo)g(Qt)e- ds dQ 

for all f,ge L 2 (R, ngdx). 



Proof. By the Trotter product formula 

(fn ,e- iL 9 n ) = hm (fQo,(e- tLo/n e- tV/n y i 9^o) 

n— >oo 

so that 

f- t "~ 1 

(/n ,e- iL ff n ) = hm / /(QoMQt)exp(— V V(Q tj/n )) dQ . (4.5) 

and since Q is continuous, we have almost everywhere 

t ™ _1 /"* 

- ^(Q«/n) -> / V(Q a )dfl. 

The integrand on the right hand side of (4.5) is dominated by 

\f(Q a )\\g(Qt)\e tmi °° 

which is in L (dQ) since 

|/(Qo)||ff(Qt)l dQ = (no|/|,e- tLo fiob|) < oo . 



The dominated convergence theorem (2.4.3) gives the claim. □ 
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4.15 Neighborhood of Brownian motion 



The Fcynman-Kac formula can be used to understand the Dirichlet Lapla- 
cian of a domain D C M. d . For more details, see [97]. 

Example. Let D be an open set in M. d such that the Lebesgue measure \D\is 
finite and the Lebesgue measure of the boundary \6D\ is zero. Denote by Hp 
the Dirichlet Laplacian — A/2. Denote by krj(E) the number of eigenvalues 
of Hd below E. This function is also called the integrated density of states. 
Denote with K d the unit ball in R d and with \K d \ = Vol(K d ) = 7r d / 2 r(f + 
its volume. Weyl's formula describes the asymptotic behavior oiko(E) 
for large E: 

k D {E) \K d \ ■ \D\ 

E^oo E d / 2 2 d / 2 TT d ' 

It shows that one can read off the volume of D from the spectrum of the 
Laplacian. 

Example. Put n ice balls Kj :1ll 1 < j < n of radius r n into a glass of water 
so that n ■ r„ = a. In order to know, how good this ice cools the water it is 
good to know the lowest eigenvalue Ei of the Dirichlet Laplacian Hd since 
the motion of the temperature distribution u by the heat equation u = Hdu 
is dominated by e _tEl . This motivates to compute the lowest eigenvalue of 
the domain D \ U™=i Kj,n- This can be done exactly in the limit n — > oo 
and when ice K^ n is randomly distributed in the glass. Mathematically, 
this is described as follows: 

Let D be an open bounded domain in R d . Given a sequence x = (xi , X2 , • ■ • ) 
which is an element in D N and a sequence of radii r% , f2, . . . , define 

n 

D n = D\\J{\x- Xl \<r n } . 

i=l 

This is the domain D with n points balls Kj_ n with center x\, . . .x n and ra- 
dius r n removed. Let H (x, n) be the Dirichlet Laplacian on D n and Efc(x, n) 
the fc-th eigenvalue of H(x, n) which are random variable Efc(n) in x, if D N 
is equipped with the product Lebesgue measure. One can show that in the 
case nr n — > a 

E fc (7i)^E fc (0) + 2™|Z?r 1 

in probability. Random impurities produce a constant shift in the spectrum. 
For the physical system with the crushed ice, where the crushing makes 
nr n — > oo, there is much better cooling as one might expect. 

Definition. Let Ws(t) be the set 

{x £ R d | \x - B t (uj)\ < 5, for some s £ [0, t}} . 

It is of course dependent on lu and just a (5-neighborhood of the Brownian 
path -B[o,t] (w). This set is called Wiener sausage and one is interested in the 
expected volume |W,5(t)| of this set as 5 — > 0. We will look at this problem 
a bit more closely in the rest of this section. 
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Figure. A sample of Wiener 
sausage in the plane d = 2. A 
finite path of Brownian motion 
with its neighborhood W$. 




Lets first prove a lemma, which relates the Dirichlet Laplacian Ho = —A/2 
on D with Brownian motion. 



Lemma 4.15.1. Let D be a bounded domain in M. d containing and 
pu(x,y,t), the integral kernel of e~ tH , where H is the Dirichlet Laplacian 
on D. Then 

E[B S e D; < s < t] = 1 - J p D (0, x, t) dx . 



Proof, (i) It is known that the Dirichlet Laplacian can be approximated in 
the strong resolvent sense by operators Ho + XV, where V = 1d c is the 
characteristic function of the exterior D c of D. This means that 

(H + X ■ vyhi -> {H D - z)~y A -> oo 

for z outside [0, oo) and all u G C™(R d ). 

(ii) Since Brownian paths are continuous, we have J* V{B S ) ds > if and 
only if B s € C c for some s G [0,t]. We get therefore 

point wise almost everywhere. 

Let it„ be a sequence in C;? converging point wise to 1. We get with the 
dominated convergence theorem (2.4.3), using (i) and (ii) and Feynman- 
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Kac 

E[B S € D c ; < s < t] = lim E[w„(B s ) efl c ;0<s<(] 

n— >oo 

= lim lim E[e" A /o Vt - B ^ ds Un {B t )} 

n— >oo A— >oo 

= lim lim e- t(Ho+A - y) u„(0) 

n— >oo A— >oo 

= lim e- tffD w„(0) 

= lim / pd{0, x, t)u n (0) dx — I po(0,x,t) dx . 

n^oo J J 

□ 



Theorem 4.15.2 (Spitzer). In three dimensions d = 3, 

4-7T 

E[|W 4 (t)|] = 27T(5t + 46 2 V2rt + — S 3 



Proof. Using Brownian scaling, 

E[|Wa5(A 2 ^)|] - E[\{\x-B s \ < XS,0 < s < \ 2 t}\] 

= E[|{|^-^|<5,0<S = .s/A 2 <i}|] 

= m\j-B s \<6,0<S<t}\] 

= X 3 -E[\W s (t)\] , 

so that one assume without loss of generality that 5=1: knowing E[|Wi(t)|] 
we get the general case with the formula E[|Wa(t)|] = S 3 ■ E[\Wi(S~ 2 t)\}. 

Let K be the closed unit ball in R d . Define the hitting probability 
f(x, t)=P[x + B s G K; < s < t] . 

We have 

E[|Wi(i)|] = f f(x,t)dx. 

Proof. 

E[|Wi(t)|] = J Jp[xeWi(t)]dxdB 

= J J p l B s - x e K:0 < s <t] dx dB 
P[B S - x G K; < s < t] dB dx 
f(x, t) dx . 
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The hitting probability is radially symmetric and can be computed explic- 
itly in terms of r = \x\: for \x\ > 1, one has 

,/ n 2 f°° _M+£rili , 

/(a:,*) = ■_ / e « dz . 



rv2nt Jo 

Proof. The kernel of e _tff satisfies the heat equation 
9tPM,f) = (A/2)p(z,0,t) 



inside I?. From the previous lemma follows that / = (A/2)/, so that the 

2 



function g(r, i) = rf(x, t) satisfies g = 2( -g r w g(r, t) with boundary condition 



g(r,0) = 0,5(1, t) = 1. We compute 

/ f(x,t) dx = 2i:t + Ay/2wt 

J\x\>l 

and f(x,t) dx = An/3 so that 

E[|Wi(i)| = 27rt + 4V2^rt + 47r/3 . 

□ 



Corollary 4.15.3. In three dimensions, one has: 

limJ-E[\W 5 (t)\] = 27rf 

6-tO 

and 

lim 7-E[|W*(i)|] = 2tt5 . 

t— foo C 



Proof. The proof follows immediately from Spitzcr's theorem (4.15.2). □ 

Remark. If Brownian motion were one-dimensional, then (5~ 2 E[|W,5(i)|] 
would stay bounded as S — > 0. The corollary shows that the Wiener sausage 
is quite " fat" . Brownian motion is rather " two-dimensional" . 

Remark. Kesten, Spitzer and Wightman have got stronger results. It is 
even true that Ums-yo\Ws{t)\/t = 2ir5 and limt-yoo \Wg(t)\/t = 2nd for 
almost all paths. 
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4.16 The Ito integral for Brownian motion 



We start now to develop stochastic integration first for Brownian motion 
and then more generally for continuous martingales. Lets start with a mo- 
tivation. We know by theorem (4.2.5) that almost all paths of Brownian 
motion are not differentiable. The usual Lebesgue-Stieltjes integral 

/ f(B s )B s ds 
Jo 

can therefore not be defined. We are first going to see, how a stochas- 
tic integral can still be constructed. Actually, we were already dealing 
with a special case of stochastic integrals, namely with Wiener integrals 
J f(B s ) dB s , where / is a function on C([0,oo],R d ) which can contain for 

example f Q V(B S ) ds as in the Feynman-Kac formula. But the result of this 
integral was a number while the stochastic integral, we are going to define, 
will be a random variable. 

Definition. Let B t be the onc-dimcnsional Brownian motion process and 
let / be a function / : R — > R. Define for n £ N the random variable 

2" 2" 

Jn(f) = ^ /(S( m _i) 2 -")(B m2 -» - -B(m-1)2-™) = : X] Jn,m(f) ■ 
rn—1 m—1 

We will use later for J n , m (f) also the notation f(Bt m _ 1 )5 n B tm , where 
S n B t = B t — B t _ 2 -n. 

Remark. We have earlier defined the discrete stochastic integral for a pre- 
visiblc process C and a martingale X 

/n 
C dX) n = C m (X m — X m -\) . 
m=l 

If we want to take for C a function of X, then we have to take C m = 
f(X m -i). This is the reason, why we have to take the differentials S n Bt m 
to " stick out into future" . 

The stochastic integral is a limit of discrete stochastic integrals: 



Lemma 4.16.1. If / £ C^R) such that /, /' are bounded on R, then J n (f) 
converges in C 2 to a random variable 

/ f(B s ) dB = lim J n 
Jo n ^°° 

satisfying 

|| / f(B s )dB\\l = E[f f(B s fds] . 
Jo Jo 
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Proof, (i) For i ^ j we have E[J n ,i(f) Jn,j(f)] = 0. 

Proof. For j > i, there is a factor Bj 2 -n — B^_i) 2 - n of J n ,i(f)Jn.j(f) inde- 
pendent of the rest of J n ,i(f)Jn,j(f) and the claim follows from E[-By 2 -n — 
5(j_i)2-»] = 0. 

(ii) E[J„, m (/) 2 ] = E[/(B (m _ 1)2 -„) 2 ]2-". 

Proof. /(-B( m _ 1 w 2 ») is independent of (B m2 - n — B/ m -i)2-™) which has 
expectation 2~™. 

(iii) From (ii) follows 

2™ 

l|Jn(/)||2= E E [/( B ("-D2-") 2 ] 2 ^ • 
m— 1 

(iv) The claim: J„ converges in £ 2 . 

Since / G C\ there exists C = and this gives |/(a;) - /(y)| 2 < 

C- \x-y\ 2 . We get 

||Jn+l(/)-Jn(/)|| 
2 n -l 

= ]T E[(/(B (2m+1)2 _ (n+1) ) - /(S (2m)2 - { » + i) )) 2 ]2-(" +1 ) 

m— 1 
2 n -l 



- G E [(^(2m+l)2-(- + 1 ) - ^(2?n)2-(" + D) 2 ] 2 (n+1) 

m— 

= C-2" 



m— 1 
5 -n-2 



where the last equality followed from the fact that E[(2?( 2 m+i)2-("+ 1 ) 

, (2m)2~<" + 

sequence in £ 2 and has therefore a limit 



^(2m)2-("+ 1 )) 2 ] = 2 " since B is Gaussian. We see that J n is a Cauchy 



(v) The claim || ft f(B s ) dB\\ 2 = E[ f(B s ) 2 ds]. 

Proof. Since J2 m f(B( m -i)2-") 2 2~ n converges point wise to f Q f(B s ) 2 ds, 
(which exists because / and B s are continuous), and is dominated by | 1 



loo' 

the claim follows since J n converges in C 2 . □ 

We can extend the integral to functions /, which arc locally L 1 and bounded 
near 0. We write Lf (M.) for functions / which are in L P (I) when restricted 
to any finite interval / on the real line. 



Corollary 4.16.2. J Q f(B s ) dB exists as a C 2 random variable for / G 
Lj oc (R) n L°°(-e, e) and any e > 0. 
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Proof, (i) If / G Lj oc (R) n L°°(-e, e) for some e > 0, then 

E[ f f{B s f ds}= f [ 4=^e~ x 2/2s < oo . 

Jo Jo Jr v27TS 

(ii) If / G L} oc {R) n L°°(-e, e), then for almost every B(u), the limit 



lim / l { _ aM (B s )f(B s ) 2 ds 
a - >00 Jo 



exists point wise and is finite. 

Proof. B s is continuous for almost all oj so that l[_ a , a ](-B s )/(.B) is indepen- 
dent of a for large a. The integral Eff 1 lr_ a ](B s )f(B s ) 2 ds] is bounded 
by E[/(B S ) 2 ds] < oo by (i). 

(iii) The claim. 

Proof. Assume / G Ll 0C (R)nL°°(-e, e). Given f n G C 1 (R) with l[_ 0>0 ]/„ -> 
/ in L 2 (R). 

By the dominated convergence theorem (2.4.3), we have 

J l { -a,a]f n {B s )dB^ J l { - a,a}f(B s ) dB 

in C 2 . Since by (it), the C 2 bound is independent of a, we can also pass to 
the limit a — > oo. □ 

Definition. This integral is called an Ito integral. Having the one-dimensional 
integral allows also to set up the integral in higher dimensions: with Brow- 
nian motion in M. d and / G Lf oc (R d ) define the integral L f(B s ) dB s 
component wise. 



Lemma 4.16.3. For n — > oo, 

2" 2™ 

E J n,j(l) 2 = E^'/ 2 " - B u _ 1)/2 „) 2 -> I . 
j=l 3 -=l 



Proof. By definition of Brownian motion, we know that for fixed n, J nj - 
are N(0, 2~™)-distributed random variables and so 

2" 

EE J ™.j( 1 ) 2 ] = 2 " • Var[S i/2 n - S (i _i )/2 »] = 2"2"" = f . 

Now, Xj- = 2" J nj are IID iV(0, l)-distributed random variables so that by 
the law of large numbers 

1 2 " 

2» / j J 

3 = 1 

for n-^co. □ 
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The formal rules of integration do not hold for this integral. We have for 
example in one dimension: 

J\ s dB = \{Bl-l)^ l -{Bl-Bl). 

Proof. Define 

2™ 

J n = /(^(m-l)2-")(^m2-» — -B(m-1)2-") ) 

TCI— 1 

2™ 

Jn = } y f{B m 2-n){B m2 -n - B( m _l) 2 -n) . 
m—1 

The above lemma implies that J+ — J~ — > 1 almost everywhere for n — > oo 
and we check also J+ + J~ = B\. Both of these identities come from 
cancellations in the sum and imply together the claim. □ 

We mention now some trivial properties of the stochastic integral. 



Theorem 4.16.4 (Properties of the Ito integral). Here are some basic prop- 
erties of the Ito integral: 

(1) /„* f{B s ) + g(B s ) dB s = Jl f(B s ) dB s + f* g(B s ) dB s . 

(2) fi\-f(Bs)dB a = \-f*f(B a )dB 3 . 

(3) 1 1-> f Q f(B s ) dB s is a continuous map from M + to C 2 . 

(4) E[J*f(B s )dB s } = 0. 

(5) J Q f(B s ) dB s is At measurable. 



Proof. (1) and (2) follow from the definition of the integral. 
For (3) define X t = f* f(B s ) dB. Since 



\Xt — Xt+eW?, 



E[J f(B s ) 2 ds] 

e~ x2 / 2s dxds^O 



t+e r m 2 

Jr s/2ns 
for e — > 0, the claim follows. 

(4) and (5) can be seen by verifying it first for elementary functions /. □ 
It will be useful to consider an other generalizations of the integral. 

Definition. If dW = dxdB is the Wiener measure on R d x C([0, oo), define 



f(W s ) dW s 



ii 



/ / f{x + B s )dBsdx 

JR d JO 
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Definition. Assume / is also time dependent so that it is a function on 
M. d x R. As long as E[J* Q \f(B s , s)\ 2 ds] < oo, we can also define the integral 

t 

f(B a ,s) ds. 

The following formula is useful for understanding and calculating stochas- 
tic integrals. It is the "fundamental theorem for stochastic integrals" and 
allows to do "change of variables" in stochastic calculus similarly as the 
fundamental theorem of calculus does for usual calculus. 



Theorem 4.16.5 (Ito's formula). For a C 2 function f(x) on R d 

f(B t ) - f(B ) = f Vf(B s ) ■ dB s + I f Af(B s ) ds . 
Jo 1 Jo 



If B s would be an ordinary path in M. d with velocity vector dB s — B s ds, 
then we had 

f(B t ) - /(B ) = / Vf(B s ) ■ B s ds 
Jo 

by the fundamental theorem of line integrals in calculus. It is a bit surprising 
that in the stochastic setup, a second derivative A/ appears in a first order 
differential. One writes sometimes the formula also in the differential form 

df = V/ dB + -A/ dt . 

Remark. We cite [11]: "Ito's formula is now the bread and butter of the 
"quant" department of several major financial institutions. Models like that 
of Black-Scholes constitute the basis on which a modern business makes de- 
cisions about how everything from stocks and bonds to pork belly futures 
should be priced. Ito's formula provides the link between various stochastic 
quantities and differential equations of which those quantities are the so- 
lution." For more information on the Black-Scholes model and the famous 
Black-Scholes formula, see [16]. 

It is not much more work to prove a more general formula for functions 
f(x,t), which can be time-dependent too: 



Theorem 4.16.6 (Generalized Ito formula). Given a function f(x, t) on Mr x 
[0,t] which is twice diffcrentiable in x and diffcrcntiable in t. Then 

f(B t ,t)-f(B o ,0)= [ Vf{B Bt s)-dB B +\ f Af(B s ,s)ds+f f(B s ,s)ds. 
Jo z Jo Jo 
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In differential notation, this means 

df = Wf dB + (±Af + f) dt. 

Proof. By a change of variables, we can assume t = 1. For each n, we 
discretized time 

{0<2~ n <...,t k = k-2~ n ,--- ,1} 
and define 5 n B tk = B tk — B tk _ 1 . We write 

2" 



/(B x ,l) - f(B ,0) = ^2(Vf)(B tk _ 1 ,t k . 1 )5 n B tk 

k=l 

2" 

+ 5^/(B tfc ,t fc _i) - /(Bt^.tfc-i) - (VfXBt^^tk^SnBt, 
fc=i 

2" 

+ ^2f(Bt k ,t k )-f(B tk ,t k -i) 
fc=i 

= Z„ + //„ + ///„ . 

(i) By definition of the Ito integral, the first sum /„ converges in £ 2 to 
Si(yf){B a ,a) dB s . 

(ii) If p > 2, we have J2fc=i l<^i-Bt fc | p — > for n — > oo. 

Proof. S n B tk is a iV(0, 2 _n )-distributed random variable so that 

/oo 
-oo 

This means 

2™ 

E[]T |5„5 t J p ] = C2"2-(™ p )/ 2 
fc=i 

which goes to zero for n — > oo and p > 2. 

(iii) X)fe=i E[(i?t* — S tfc _ 1 ) 4 ] — > follows from (ii). We have therefore 

2™ 2" 

^E[g(B tfc ,t fc ) 2 ((i? tfc -i? t& _ 1 ) 2 -2-») 2 ] < C^Vai^-B^) 2 ] 
fe=i fe=i 

2™ 

< C75>[(i? tfc -iVi) 4 ]-M) 
fc=l 

(iv) Using a Taylor expansion 

/(*) = /(y) - V/(y)(x - y) - \ d XiXj f(y)(x - y)i(x - y) 3 +0(\x- y\ 3 , 

id 
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we get for n — > oo 

2" 



k—1 i,j 

in £ 2 . Since 

2" 

E o^/(^-i.*fc-i)pT.^J*(4» fl tJj - Sn2- n ] 



fe=l 



goes to zero in £ 2 (applying (ii) for g = d XiXj f and note that (S n B tk )i and 
(5 n B tk )j are independent for i ^ j), we have therefore 



1 f* 

U n ^- j Af(B s ,s)ds 



r 2 

in L . 



(v) A Taylor expansion with respect to t 

f(x, t) - f(x, s) - f{x, s)(t -s) + 0((t - s) 2 ) 

gives 

HI n ^ f f(B s ,s)ds 
Jo 

in C 1 because s — > /(B S) s) is continuous and is a Riemann sum 

approximation. □ 

Example. Consider the function 

f(x,t)=e ax ~ a2t/2 . 

Because this function satisfies the heat equation f + f" /2 = 0, we get from 
Ito's formula 

f(B t ,t) - /(B ,*) =a [ f(B s ,s) ■ dB s . 
Jo 

We see that for functions satisfying the heat equation / + f"/2 = Ito's 
formula reduces to the usual rule of calculus. If we make a power expansion 
in a of 

e aB s -a 2 s/2 _ ^^B.-a 2 s/2 _ ± 

a a 
we get other formulas like 



J\ s dB= l -{B 2 t -t) 
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Wick ordering. 

There is a notation used in quantum field theory developed by Gian-Carlo 

Wick at about the same time as Ito's invented the integral. This Wick 
ordering is a map on polynomials a{X l which leave monomials (poly- 

nomials of the form x" + a„_i:r" _1 • ■ ■ ) invariant. 



Definition. Let 



fl n {x) 



be the n'-th eigenfunction of the quantum mechanical oscillator. Define 

and extend the definition to all polynomials by linearity. The Polynomials 
: x n : are orthogonal with respect to the measure fl^dy = -K~ x l 2 e~ y dy 
because we have seen that the eigenfunctions Q n are orthonormal. 



Example. Here are the first Wick powers: 



: x : 


= X 






x 2 : 


= x 2 






x 3 : 


= x 3 






x 4 : 


= x 4 






x 5 : 


= x 5 



10x 3 + 15x . 



Definition. The multiplication operator Q : f H> xf is called the position 
operator. By definition of the creation and annihilation operators one has 
Q = -L(A + A*). 

The following formula indicates, why Wick ordering has its name and why 
it is useful in quantum mechanics: 



Proposition 4.16.7. As operators, we have the identity 



2™/ 



Definition. Define L = £™ =0 ( n . ^ (A*y A n ~ 
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Proof. Since we know that fl n forms a basis in L 2 , we have only to verify 
that :Q n -.n k = 2~ n / 2 m k for all k. From 

2- 1 ' 2 [Q,L] = [A + A*,J2 ( n 

= E ( 7 ) jW -1 ^-' - (» - i) W^'" 1 

j=o V ■ y / 

= 

we obtain by linearity [i/ fc (v^Q), £]• Because : Q" : O = 2-"/ 2 (n!) 1 / 2 fi„ = 
2-"/ 2 (A*)"^ = 2~ n / 2 m 0l we get 

= (: Q n : -2- n/2 L)n 

= (fc!)- 1 /2 jfffc (^ Q )( : Q« : _ 2 -"/ 2 L)r! 

= (: Q n : -2-"/ 2 L)(fc!)- 1 / 2 i/ fe (Vig)^o 
= (: Q n : -2~ n/2 L)Q, k . 

□ 

Remark. The new ordering made the operators A, A* behaves as if A, B 
would commutate. even so they don't: they satisfy the commutation rela- 
tions [A, A*] = 1: 

The fact that stochastic integration is relevant to quantum mechanics can 
be seen from the following formula for the Ito integral: 



Theorem 4.16.8 (Ito Integral of B n ). Wick ordering makes the Ito integral 
behave like an ordinary integral. 



: B n s : dB a = -L- : //;• 



Remark. Notation can be important to make a concept appear natural. An 
other example, where an adaption of notation helps is quantum calculus, 
"calculus without taking limits" [45], where the derivative is defined as 
Dqf(x) = d q f(x)/d q (x) with d q f(x) = f(qx) — f(x). One can see that 
D q x n = where [n] = q ~f . The limit q — >• 1 corresponds to the 

classical limit case h — > of quantum mechanics. 

Proof. By rescaling, we can assume that t = 1. 

We prove all these equalities simultaneously by showing 

f : e aB ° : dB = a' 1 : e aBl : -a' 1 . 
Jo 
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The generating function for the Hermite polynomials is known to be 



n 



n=0 

(We can check this formula by multiplying it with f2o, replacing x with 
xj \[2 so that we have 



E fl n (x)g n 
(n!)V2 

n=0 y ' 



tt 2 x 2 
2 2 



If we apply A* on both sides, the equation goes onto itself and we get after 
k such applications of A* that that the inner product with VL^ is the same 
on both sides. Therefore the functions must be the same.) 
This means 



E 



3=0 



Since the right hand side satisfies / + f"/2 = 2, the claim follows from the 
Ito formula for such functions. □ 

We can now determine all the integrals J B™ dB: 



ft 

1 dB = B t 



J B s dB = \(B 2 t - 1) 

I B 2 s dB = [ : B 2 S :+ldB = B t + i(: B t : 3 ) = B t + \{Bl - 3B t ) 

JO JO 6 6 

and so on. 

Stochastic integrals for the oscillator and the Brownian bridge process. 

Let Q t = e~'£> e 2t /y/2 the oscillator process and At = (1 — t)B t /n_ t \ the 
Brownian bridge. If we define new discrete differentials 

S n Qt k = Qt k+1 -e-^-^Q tk 
S n A tk = A tk+1 -A tk + -^-^A tk 

the stochastic integrals can be defined as in the case of Brownian motion 
as a limit of discrete integrals. 



Feynman-Kac formula for Schrodinger operators with magnetic fields. 

Stochastic integrals appear in the Feynman-Kac formula for particles mov- 
ing in a magnetic field. Let A(x) be a vector potential in M 3 which gives 
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the magnetic field B(x) = curl(A). Quantum mechanically, a particle mov- 
ing in an magnetic field together with an external field is described by the 
Haniiltonian 

H = (iV + A) 2 + V . 

In the case A = 0, we get the usual Schrodinger operator. The Fcynman- 
Kac formula is the Wiener integral 

e- tH u(0) = J e^ B ^u{B t )dB , 

where F(B,t) is a stochastic integral. 

F{B, t) = i J a(B s ) dB+^J div(A) ds + J V(B S ) ds . 

4.17 Processes of bounded quadratic variation 

We develop now the stochastic Ito integral with respect to general martin- 
gales. Brownian motion B will be replaced by a martingale M which are 
assumed to be in C 2 . The aim will be to define an integral 

t 

K s dM s , 

where if is a progressively measurable process which satisfies some bound- 
edness condition. 

Definition. Given a right-continuous function / : [0, oo) — > M. For each 
finite subdivision 

A = {0 = t o ,t 1 ,...,t = t n } 

of the interval [0,t] we define |A| = sup[ =1 \U+i —U\ called the modulus of 
A. Define 

n-l 

ll/IU = ^|/ ti+1 -/ ti | . 

i=0 

A function with finite total variation ||/||t = sup A ||/||a < oo is called a 
function of finite variation. If sup t \f\t < oo, then / is called of bounded 
variation. One abbreviates, bounded variation with BV. 

Example. Diffcrcntiable C 1 functions are of finite variation. Note that for 
functions of finite variations, Vt can go to oo for t — > oo but if Vt stays 
bounded, we have a function of bounded variation. Monotone and bounded 
functions are of finite variation. Sums of functions of bounded variation are 
of bounded variation. 

Remark. Every function of finite variation can be written as / = /+ — /~, 
where J* are both positive and increasing. Proof: define / = (±/ t + 
ll/IW/2. 
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Remark. Functions of bounded variation are in one to one correspondence 
to Borel measures on [0,oo) by the Stieltjes integral J Q \df\ = / t + + / ( ~. 

Definition. A process Xt is called increasing if the paths Xt(uj) are finite, 
right-continuous and increasing for almost all u E ft. A process Xt is called 
of finite variation, if the paths Xt(ui) are finite, right-continuous and of 
finite variation for almost all uj € ft. 

Remark. Every bounded variation process A can be written as A t = Af — 
, where Af are increasing. The process V t — f Q \dA\ a = Af + A^ is 
increasing and we get for almost all u € O a measure called the variation 
of A. 

If X t is a bounded ^-adapted process and A is a process of bounded 
variation, we can form the Lebcsguc-Sticltjcs integral 

{X-A) t {u)= [ X,(w) dA s {u) . 
Jo 

We would like to define such an integral for martingales. The problem is: 



Proposition 4.17.1. A continuous martingale M is never of finite variation, 
unless it is constant. 



Proof. Assume M is of finite variation. We show that it is constant. 

(i) We can assume without loss of generality that M is of bounded varia- 
tion. 

Proof. Otherwise, we can look at the martingale M Sn , where S n is the 
stopping time S n = inf{s | V s > n} and Vt is the variation of M on [0, i]. 

(ii) We can also assume also without loss of generality that Mo = 0. 

(iii) Let A = {to = 0, ti, . . . ,t n — t} be a subdivision of [0, t]. Since M is a 
martingale, we have by Pythagoras 

fc-i 

E[M t 2 ] = E[^(Af t 2 i+i -M t 2 J] 

i=0 

fc-i 

= E[^(Af ti+1 - M u ){M ti+1 + M ti )} 

i=l 
k—1 

= E[^(Af tl+1 -M ti ) 2 ] 
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and so 



265 



E[M 2 ] < E[Vt(sup |M ti+1 - M ti \] < K • E[sup \M ti+1 - M u \\ . 

i i 

If the modulus |A| goes to zero, then the right hand side goes to zero since 
M is continuous. Therefore M = 0. □ 

Remark. This proposition applies especially for Brownian motion and un- 
derlines the fact that the stochastic integral could not be defined point wise 
by a Lebesguc-Stieltjcs integral. 



Definition. If A = {t = < h < 

only finitely many points {to,t\, 
for a process X 



. . . } is a subdivision of M + = [0, co) with 
. . . , tk } in each interval [0,t], we define 



fe-i 

T t A = T t A (X) = C£(x ti+1 X ti f) + (X t ~ x tk ) 2 . 

i=0 

The process X is called of finite quadratic variation, if there exists a process 
< X, X > such that for each t, the random variable T t A converges in 
probability to < X, X > t as |A| 0. 



Theorem 4.17.2 (Doob-Meyer decomposition). Given a continuous and 
bounded martingale M of finite quadratic variation. Then < M, M > is 
the unique continuous increasing adapted process vanishing at zero such 
that Af 2 — < M,M > is a martingale. 



Remark. Before we enter the not so easy proof given in [86] , let us mention 
the corresponding result in the discrete case (see theorem (3.5.1), where 
M 2 was a submartingalc so that M 2 could be written uniquely as a sum 
of a martingale and an increasing prcvisible process. 

Proof. Uniqueness follows from the previous proposition: if there would be 
two such continuous and increasing processes A,B, then A — B would be 
a continuous martingale with bounded variation (if A and B are increas- 
ing they are of bounded variation) which vanishes at zero. Therefore A = B. 

(i) M 2 — T t A (M) is a continuous martingale. 

Proof. For t j < s < tj+i, we have from the martingale property using that 
(M t . +1 - M s ) 2 and (M s - M ti ) 2 are independent, 

E[(M ti+1 - M ti f | As] = E[(M ti+1 - M S ) 2 \A S ] + (M s - M u ) 2 . 
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This implies with = to < t\ < ■ ■ ■ < ti < s < ti+i < ■ ■ ■ < t\ < t and 
using orthogonality 

k 

E[T A (M)-T A (M)\A S ] = E£>/ tj+1 -M tj )V s ] 

3=1 

+ E[(M t - M tk f\A s ) + E[(M, - M tl f\A s ] 
= E[(M t - M a f\A s ] = E[M t 2 - M 2 s \As] ■ 
This implies that M 2 — T t A (M) is a continuous martingale. 

(ii) Let C be a constant such that \M\ < C in [0,a]. Then E[T a A ] < 4C 2 , 
independent of the subdivision A = {to, • ■ • , t n } of [0, a]. 

Proof. The previous computation in (i) gives for s = 0, using T Q A (A1 ) = 

E[T t A (M)|.4o] = E[A/ 2 - A/qVo] < E[(M t - M )(M t + M )} < AC 2 . 

(iii) For any subdivision A, one has E[(T A ) 2 ] < 48C" 4 . 
Proof. We can assume t n — a. Then 



(T a A (A/)) 2 = {Y,{Mt k -M tk _ x ff 

k=l 

n n 

= 2^(T A - T A )(T A - T A _J + ^(M tk - M tfe _J 4 . 
fe=l fe=l 

From (i), we have 

E[T A - T A |AJ = E[(M a - M tk f | A tk ] 
and consequently, using (ii) 

n n 

E[(T A ) 2 ] = 2^E[(A../ a -Af tfc ) 2 (T A -T A +i )] + ^E[(A/ tfc -A/ tfc _ 1 ) 4 ] 
fe=i fc=i 

< E[(2 sup \M a - M tfe | 2 + sup \M tk - M tj , _ t | 2 )T A ] 

k k 

< 12C 2 E[T A ] < 48C 4 . 

(iii) For fixed a > and subdivisions A„ of [0,a] satisfying |A„| — > 0, the 
sequence T A " has a limit in C 2 . 

Proof. Given two subdivisions A', A" of [0, o], let A be the subdivision 
obtained by taking the union of the points of A' and A". By (i), the process 
X = T A — T A is a martingale and by (i) again, applied to the martingale 
X instead of M we have, using (x + y) 2 < 2{x 2 + y 2 ) 

E[X 2 } = E[(T A ' - T A ") 2 } = E[T A (X)] < 2(E[T A (T A ')} + E[T A (T A ")]) . 

We have therefore only to show that E[T A (T A ')] for |A'| + |A"| 0. 
Let Sfe be in A and t m the rightmost point in A' such that t m < Sk < 
Sfe+i < t m+1 . We have 

T s t' +1 -T A ' = (M Sk+1 - M tm ) 2 ~ (M Sk ~ M tm ) 2 

= (M Sk+1 - M Sk ) (M Sk+1 + M Sk - 2M tm ) 



4.17. Processes of bounded quadratic variation 267 
and so 

T A (T A ') < (sup \M Sk+1 + M Sk - 2M t J 2 )T A . 

k 

By the Cauchy Schwarz-inequality 

E[T A (T A ')] < E[su P \M Sk+1 +M 3k - 2M tm \ i ] 1 ' 2 E[{T^f} 1 ' 2 

k 

and the first factor goes to as |A| — > and the second factor is bounded 
because of (iii). 

(iv) There exists a sequence of A„ C A„ +1 such that T A "(M) converges 
uniformly to a limit (M,M) on [0, a]. 

Proof. Doob's inequality applied to the discrete time martingale T A " —T Am 
gives 

E[sup |T A " - T Am | 2 ] < 4E[(T A " - T Am ) 2 ] . 

t<a 

Choose the sequence A„ such that A„ + i is a refinement of A„ and such 
that {J n A n is dense in [0,a], we can achieve that the convergence is uni- 
form. The limit (M, M) is therefore continuous. 

(v) (M, M) is increasing. 

Proof. Take A n C A„+i. For any pair s < t in (J n A„, we have T S A ™(M) < 
T A "(M) if n is so large that A„ contains both s and t. Therefore (M, M) 
is increasing on (J n A n , which can be chosen to be dense. The continuity 
of M implies that (M, M) is increasing everywhere. □ 

Remark. The assumption of boundedness for the martingales is not essen- 
tial. It holds for general martingales and even more generally for so called 
local martingales, stochastic processes X for which there exists a sequence 
of bounded stopping times T n increasing to oo for which X Tn are martin- 
gales. 



Corollary 4.17.3. Let M,N be two continuous martingales with the same 
filtration. There exists a unique continuous adapted process (M, N) of finite 
variation which is vanishing at zero and such that 

MN - (M, N) 

is a martingale. 



Proof. Uniqueness follows again from the fact that a finite variation mar- 
tingale must be zero. To get existence, use the parallelogram law 



(M, N) = -((M + N, M + N) - (M — N,M — N)) 
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This is vanishing at zero and of finite variation since it is a sum of two 
processes with this property. 

We know that M 2 ~ (M, M),N 2 - (N, N) and so that (M ± N) 2 - (AI ± 
N,M ± TV) are martingales. Therefore 



(AI + N) 2 - (M + N,M + N)- (AI - N) 2 - (M — N,M — N) 
= AAIN - (AI + N,M + N) - (M — N,M — N) . 



Definition. The process (M, N) is called the bracket of M and N and 

(M, M) the increasing process of M . 

Example. If B = {B^\ B^) is Brownian motion, then (< B^,B^) = 
5ijt as we have computed in the proof of the Ito formula in the case t = 1. 
It can be shown that every martingale M which has the property that 



must be Brownian motion. This is Levy's characterization of Brownian 
motion. 

Remark. If M is a martingale vanishing at zero and (M, M) = 0, then 
AI = 0. Since M 2 — (M, M) t is a martingale vanishing at zero, we have 



Remark. Since we have got (M, M) as a limit of processes T t , we could 
also write (M, N) as such a limit . 

4.18 The Ito integral for martingales 

In the last section, we have defined for two continuous martingales M,N, 
the bracket process (AI , N) . Because (M, M) was increasing, it was of fi- 
nite variation and therefore also (M, N) is of finite variation. It defines a 
random measure d(M, N). 



Theorem 4.18.1 (Kunita-Watanabc inequality). Let M,N be two continu- 
ous martingales and H, K two measurable processes. Then for all p, q > 1 
satisfying l/p+l/g=l,we have for all t < oo 



and AIN — (AI, N) is a martingale. 



□ 



E[Af t 2 ] =E[(M,M) t ]. 




4.18. The Ito integral for martingales 269 
Proof, (i) Define {M,N)\ = (M,N) t - (M,N) S . Claim: almost surely 

\{M,N)l\< ((M,M)iy/ 2 ((N,N)t)V 2 . 

Proof. For fixed r, the random variable 

(M, M)\ + 2r(M, N)l + r 2 (N, JV)* = (AI + rN, AI + WV)* 

is positive almost everywhere and this stays true simultaneously for a dense 
set of r € K. Since M, N are continuous, it holds for all r. The claim follows, 
since a + 2rb + cr 2 > for all r > with nonncgative a, c implies b < y/Ey/c. 

(ii) To prove the claim, it is, using Holder's inequality, enough to show 
almost everywhere, the inequality 

\H S \ \K S \ d\(M,N)\ s < ( f H 2 d(M, M)) 1 ^ 2 ■ ( f K 2 d(N, N)) 1 ^ 2 



holds. By taking limits, it is enough to prove this for t < oo and bounded 
K, H. By a density argument, we can also assume the both K and H are 
step functions H = X)"=i and K = X)"=i Kiljn where Ji = [ti, ij+i). 

(iii) We get from (i) for step functions H, K as in (ii) 

| f H s K s d(M,N) s \ < S^lHiKiWiMtN)^ 
Jo 

< ^|^ i |((M,M)**+ 1 ) 1 /2((M,M>** +1 ) 1 /2 

i 

< C^H 2 {M^^f/ 2 C£K 2 (N,N^f' 2 

i i 

= ( f H*d(M,M))V 2 ■ ( [ K 2 d(N,N)y/ 2 , 
Jo Jo 

where we have used Cauchy-Schwarz inequality for the summation over 
i. □ 

Definition. Denote by H 2 the set of £ 2 -martingales which are ^4 t -adaptcd 
and satisfy 

||M|| Wa = (supEiM 2 }) 1 / 2 < oo . 

Call H 2 the subset of continuous martingales in % 2 and with Hq the subset 
of continuous martingales which are vanishing at zero. 
Given a martingale M € H 2 , we define C 2 (M) the space of progressively 
measurable processes K such that 

POO 

ll A 'll£ 2 (M) = E \ J K 2 d(M,M) s ) < oo . 
BothH 2 and £ 2 (M) arc Hilbcrt spaces. 
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Lemma 4.18.2. The space A 2 of continuous C 2 martingales is closed in A 2 
and so a Hilbert space. Also Aq is closed in H 2 and is therefore a Hilbert 
space. 



Proof. Take a sequence ilA™' in A 2 converging to M G A 2 . By Doob's 
inequality 



E[(sup |M™ - Mt|) 2 ] < 4||AfW - m 




We can extract a subsequence, for which sup t \M^ n k — M t \ converges point 
wise to zero almost everywhere. Therefore M G H 2 . The same argument 
shows also that H% is closed. □ 



Proposition 4.18.3. Given M G H 2 and A' G C 2 (M). There exists a unique 
element J Q KdM € -ffg such that 

< / KdM, N >= [ Kd(M, N) 
Jo Jo 

for every A € iJ 2 . The map A h-> L KdM is an isometry form £ 2 (Af) to 
A 2 . 



Proof. We can assume M £ Ao since in general, we define f Q A dM = 
JqK d(M — Mo). 

(i) By the Kunita-Watanabe inequality, we have for every A G Aq 
\E[J*K s d(M,N) s }\ < \\N\\ H , ■ ||A|| £2(M) . 

The map 

A^E[(/ A s )d(A/,A) s ] 
Jo 

is therefore a linear continuous functional on the Hilbert space Aq. By 
Riesz representation theorem, there is an element J K dM G Aq such that 

E[( f K s dM s )N t ] = E[ f K s d(M, N) a ] 
Jo Jo 

for every A G Aq. 
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(ii) Uniqueness. Assume there exist two martingales L, L' € Hq such that 
(L, N) = (L', N) for all N e H%. Then, in particular, (L — L',L — L') = 0, 
from which L = L' follows. 

(iii) The integral K m- f Q K dM is an isometry because 

r-t />00 

|| J KdM\\ 2 Ho = Eft J K s dM s ) 2 ] 
= E[[°° K 2 d(M,M)] 



□ 



o 



Definition. The martingale J * K s dM s is called the Ito integral of the 

progressively measurable process K with respect to the martingale M . We 
can take especially, K = /(M), since continuous processes are progressively 
measurable. If we take M = B, Brownian motion, we get the already 
familiar Ito integral. 

Definition. An At adapted right-continuous process is called a local martin- 
gale if there exists a sequence T n of increasing stopping times with T n — > oo 
almost everywhere, such that for every n, the process X Tn l^ Tn>0 ^ is a uni- 
formly integrable _4 f -martingale. Local martingales are more general than 
martingales. Stochastic integration can be defined more generally for local 
martingales. 

We show now that Ito's formula holds also for general martingales. First, 
a special case, the integration by parts formula. 



Theorem 4.18.4 (Integration by parts). Let X, Y be two continuous mar- 
tingales. Then 

X t Y t - X Q Y Q = f X s dY s + [ Y s dX s + (X, Y) t 
Jo Jo 

and especially 



X 2 t ~X 2 = 2 f X s dX s + (X, X) 
Jo 



Proof. The general case follows from the special case by polarization: use 
the special case for X ± Y as well as X and Y. 

The special case is proved by discretisation: let A = {to,t\, . . . ,t n } be a 
finite discretisation of [0,t]. Then 

n n 

J2( x n +1 - x u ) 2 = x 2 - x 2 - 2j2x ti (x ti+1 -x u ). 

i=l i=l 
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Letting |A| going to zero, we get the claim. □ 



Theorem 4.18.5 (Ito formula for martingales). Given vector martingales 
M = (M (1 \. . .,MM) and X and a function / G C 2 (M rf ,R). Then 

f(X t )-f(X Q ) = [ V/(X) dM t +lj2 [ 6 a 5 X f XiX (X s ) d(M t W ,M t W ) . 
Jo ^ Jo 



Proof. It is enough to prove the formula for polynomials. By the integration 
by parts formula, we get the result for functions f{x) = xtg(x), if it is 
established for a function g. Since it is true for constant functions, we are 
done by induction. □ 

Remark. The usual Ito formula in one dimensions is a special case 

f(X t ) - f(X ) = [ f(X s ) dB s + I / f\X s ) ds . 
Jo 1 Jo 

In one dimension and if M t = B t is Brownian motion and X t is a martin- 
gale, we have We will use it later, when dealing with stochastic differential 
equations. It is a special case, because (B t , B t ) = t, so that d(B t ,B t ) = dt. 

Example. If f(x) = x 2 . this formula gives for processes satisfying Xq = 

Xf/2 = J X s dB s + U . 

This formula integrates the stochastic integral f Q X s dB s = X 2 /2 — t/2. 
Example. If f(x) = log(x), the formula gives 

log(X t /X ) - f dB s /X s -If ds/X 2 s . 
Jo 1 Jo 

4.19 Stochastic differential equations 

We have seen earlier that if B t is Brownian motion, then X = f(B,t) = 
p aB t -a t/2 j s & mar ti n gale. In the last section we learned using Ito's formula 
and and ±A/ + / = that 

t 

aX s dM s =X t -l. 
We can write this in differential form as 



dX t = aX t dM t ,X Q = 1 
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This is an example of a stochastic differential equation (SDE) and one 
would use the notation 

^ = aX 
dM 

if it would not lead to confusion with the corresponding ordinary differential 
equation, where M is not a stochastic process but a variable and where the 



solution would be X 



Here, the solution is the stochastic process 



Definition. Let B t be Brownian motion in 
differential equation 



A solution of a stochastic 



dX t = f(X t ,B t ) ■ dB t + g(X t ) dt , 
is a Revalued process X t satisfying 

X t = [ f(X s ,B s ) -dB s + f g(X s )ds 
Jo Jo 



where / 



and g 



As for ordinary differential equations, where one can easily solve separable 
differential equations dx/dt = f{x) + g(t) by integration, this works for 
stochastic differential equations. However, to integrate, one has to use an 
adapted substitution. The key is Ito's formula (4.18.5) which holds for 
martingales and so for solutions of stochastic differential equations which 
is in one dimensions 

f(X t ) - f(X ) = f f'(X s ) dX s + \ f f{X s ) d(X Sl X s ) . 
Jo 1 Jo 

The following "multiplication table" for the product (•, •) and the differen- 
tials dt, dB t can be found in many books of stochastic differential equations 
[2, 47, 68] and is useful to have in mind when solving actual stochastic dif- 
ferential equations: 





dt 


dB t 


dt 








dB t 





t 



Example. The linear ordinary differential equation dXj dt = rX with solu- 
tion X t = e rt Xo has a stochastic analog. It is called the stochastic popula- 
tion model. We look for a stochastic process X t which solves the SDE 



^±=rXt+aXttt. 
dt 



Separation of variables gives 



dX 

—— = rtdt + otQdt 
X 
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and integration with respect to t 

f* dX t 

/ -^4 = rt + aB t . 

Jo A t 

In order to compute the stochastic integral on the left hand side, we have to 
do a change of variables with f(X) = log(x). Looking up the multiplication 
table: 

(dX t , dX t ) = (rXtdt + aX t dB t , rX t dt + a 2 X t dB t ) = a 2 X 2 dt . 
Ito's formula in one dimensions 

f(X t ) - f(X ) = f f(X s ) dX s + \ [ f"(X s )(X s ,X a ) 
Jo 1 Jo 

gives therefore 

log(X t /X ) = / dX s /X s -If a 2 ds 
Jo 1 Jo 

so that \l dX s /X s = a 2 t/2 + log(X t /X ). Therefore, 

a 2 t/2 + log(X t /X ) = rt + aB t 

and so X t = Xoe rt ~ a t / 2 + aB t, This process is called geometric Brownian 
motion. We see especially that X = X/2 + X£ has the solution X t = e Bt . 




Figure. Solutions to the stochastic Figure. Solutions to the stochastic 
population model for r > 0. population model for r < 0. 



Remark. The stochastic population model is also important when modeling 
financial markets. In that area the constant r is called the percentage drift 
or expected gain and a is called the percentage volatility. The Black-Scholes 

model makes the assumption that the stock prices evolves according to 
geometric Brownian motion. 
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Example. In principle, one can study stochastic versions of any differential 
equation. An example from physics is when a particle move in a possibly 
time-dependent force field F(x, t) with friction b for which the equation 
without noise is 

x = —bx + F(x, t) . 
If we add white noise, we get a stochastic differential equation 

'x = -bx + F(x,t) +a((t) . 

For example, with X = x and F = 0, the function v(t) satisfies the stochas- 
tic differential equation 

d J± = - b X t + a Q t , 

which has the solution 

X t = e- bt + aB t . 

With a time dependent force F(x, t), already the differential equation with- 
out noise can not be given closed solutions in general. If the friction constant 
b is noisy, we obtain 

which is the stochastic population model treated in the previous example. 

Example. Here is a list of stochastic differential equations with solutions. 
We again use the notation of white noise ((t) = which is a generalized 
function in the following table. The notational replacement dB t = Qdt is 
quite popular for more applied sciences like engineering or finance. 



Stochastic differential equation 


Solution 


i x t = 


X t =Bt 


Tt X t = BtC(t) 


x t = 




/2 = {B} - l)/2 


ix t = B'tat) 


x t = 


Bi 


/3 = (Bf - 3B t )/3 


ix t = B-fat) 


x t = 


Bf 


/4 = (Bf - 6B? + 3)/4 


ix t = BfC(t) 


x t = 


B? 


/5 = {B b t - 10B t 3 + 15B t )/5 


f t X t = aX t ((t) 


X t = e aBt ~ aZt / 2 


%X t = rX t + aX t ((t) 


X f = e rt+aB t -a't/2 



Remark. Because the Ito integral can be defined for any continuous martin- 
gale, Brownian motion could be replaced by an other continuous martingale 
M leading to other classes of stochastic differential equations. A solution 
must then satisfy 



X t = f f(X s ,M s ,s) -dM s + f g(X s ,s)ds. 
Jo Jo 

Example. 

X t = e «Mt-a 2 {X,X)t/2 



is a solution of dX t = aM t dM t , M = 1. 
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Remark. Stochastic differential equations were introduced by Ito in 1951. 
Differential equations with a different integral came from Stratonovich but 
there are formulas which relating them with each other. So, it is enough 
to consider the Ito integral. Both versions of stochastic integration have 
advantages and disadvantages. Kunita shows in his book [56] that one can 
view solutions as stochastic flows of diffeomorphisms. This brings the topic 
into the framework of crgodic theory. 

For ordinary differential equations x = f(x, t), one knows that unique solu- 
tions exist locally if / is Lipshitz continuous in x and continuous in t. The 
proof given for 1-dimensional systems generalizes to differential equations 
in arbitrary Banach spaces. The idea of the proof is a Picard iteration of 
an operator which is a contraction. Below, we give a detailed proof of this 
existence theorem for ordinary differential equations. For stochastic differ- 
ential equations, one can do the same. We will do such an iteration on the 
Hilbert space H? Q t , of C? martingales X having finite norm 

||X|| T = E[supX t 2 ] . 

t<T 

We will need the following version of Doob's inequality: 



Lemma 4.19.1. Let X be a CP martingale with p > 1. Then 
E[supLy.|*] <(-£-)* -E[\X t \P}. 



Proof. We can assume without loss of generality that X is bounded. The 
general result follows by approximating X by X A k with k — > oo. 
Define X* = sup s<t \X S \ P . From Doob's inequality 

P[X>X] <E[\X t \-l x .> x ] 

we get 

E[|X*H = E[/ pX p ~ x dX] 
Jo 

/•OO 

= E[/ pA*- 1 l {x .> X} dX] 
Jo 

POO 

= E[ / P X p - l P[X* > X] dX] 
Jo 

poo 

< E[ pA^EOZtl • 1 x .>a] dX] 
Jo 

r x * 

= pE[\X t \ / X"~ 2 dX 
Jo 

= -^E[\X t \.(X*) p - 1 ]. 
P - 1 
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Holder's inequality gives 

mx*\p] < -^mx*)p] < -p~ 1 vpE\\Xt\ p ] 1/p 
p-i 

and the claim follows. □ 



Theorem 4.19.2 (Local existence and uniqueness of solutions). Let M be a 
continuous martingale. Assume f(x,t) and g(x,t) are continuous in t and 
Lipshitz continuous in x. Then there exists T > and a unique solution 
X t of the SDE 

dX = f(x, t) dM + g(x, t) ds 
with initial condition Xn = Xn. 



Proof. Define the operator 

S(X) = f f(s,X s ) dM s + f g(s,X s ) ds 
Jo Jo 

on £ 2 -processes. Write S(X) = Si(X)+S2(X). We will show that on some 
time interval (0,T], the map S is a contraction and that S n (X) converges 
in the metric \\\X — Y\\\t = E[sup s<T (X s — Y s ) 2 ], if T is small enough to 
a unique fixed point. It is enough that for i = 1, 2 

||| < S i (X)-,S i (F)||| T <(l/4)-||X-y|| T 
then S is a contraction 

\\\S(X)-S(Y)\\\ T <(l/2)-\\X-Y\\ T . 
By assumption, there exists a constant K, such that 

\f(t, w) - f(t, w')\ < K ■ sup \w - w'\ . 

S<1 



(i) \\\S 1 (X)-S 1 (Y)\\\ T = HI /„* f(s,X s ) - f(s,Y s ) dM s \\\ T < (1/4) • \\\X - 
Y\\\t for T small enough. 
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Proof. By the above lemma for p = 2, we have 



iSipo-SiOniib 



= E[sup(/ f(s,X)- f(s,Y)dM s ) 2 } 

t<T JO 

< 4E[(/ f(t,X)- f(t,Y)dM t f] 

Jo 

= 4E[( f f(t, X) - Y)f d(M, M)t 
Jo 

< 4K 2 E{[ sup\X s - Y s \ 2 dt] 

JO s<t 



4K 2 f |||X-Y||L ds 



< (i/4).|||x-y||| T> 

where the last inequality holds for T small enough. 

(ii) |||<S 2 (X) - 5 2 (F)||| T = \\\f*g(s,X a ) - g(s,Y s ) ds\\\ T < (1/4) • |||X - 
Y\ | \t for T small enough. This is proved for differential equations in Banach 
spaces. 

The two estimates (i) and (ii) prove the claim in the same way as in the 
classical Cauchy-Picard existence theorem. □ 

Appendix. In this Appendix, we add the existence of solutions of ordinary 
differential equations in Banach spaces. Let X be a Banach space and I an 
interval in K. The following lemma is useful for proving existence of fixed 
points of maps. 



Lemma 4.19.3. Let X = B r (xo) C X and assume is a diffcrcntiable map 
X -> X. If for all x £ X, \\D<f>(x)\\ < |A| < 1 and 

\\4>(x ) - x \\ < (l-A)-r 

then <p has exactly one fixed point in X. 



Proof. The condition ||jc — xo|| < r implies that 

\\<Kx) -x \\< U(x) - 4>(x )\\ + \\ci>(x ) - x \\ < Xr + (1 - A)r = r . 

The map <§> maps therefore the ball X into itself. Banach's fixed point 
theorem applied to the complete metric space X and the contraction <p 
implies the result. □ 

Let / be a map from I x X to X . A diffcrcntiable map u : J — > X of an 
open ball J C / in X is called a solution of the differential equation 

x = f(t,x) 
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if we have for all t G J the relation 

u(t) = /(t,u(f)) 
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Theorem 4.19.4 (Cauchy-Picard Existence theorem). Let / : / x X — >• X 
be continuous in the first coordinate and locally Lipshitz continuous in the 
second. Then, for every (to, xo) G I x X, there exists an open interval J <Z I 
with midpoint to, such that on J, there exists exactly one solution of the 
differential equation x = f(t,x). 



Proof. There exists an interval J (to, a) = (to — a, to + a) C / and a ball 
B(xo,b), such that 

M = sup{||/(t,x)|| | (t,x) G J(t ,a) x B(x ,b)} 

as well as 

k = SU p{ ll/(t 'f ) ~ /( |' X2)l1 | (t, Xl ), (t, x 2 ) G J(t , a) x B(x , b),x^ x 2 } 
\\Xi - x 2 \\ 

are finite. Define for r < a the Banach space 

X r = C(J(to,r),X) = {y : J(to,r) — > X, y continuous} 
with norm 

112/11 = sup ||y(t)|| 

t&J(t ,r) 

Let V r ,b be the open ball in X r with radius b around the constant map 
1 1 ^ xq . For every y G V r ,b we define 

4>(y) : t h+ x + / f(s,y(s))ds 
J to 

which is again an element in X r . We prove now, that for r small enough, 
<j) is a contraction. A fixed point of is then a solution of the differential 
equation x = f(t, x), which exists on J — J r (to). For two points 2/1,2/2 G V r , 
we have by assumption 

\\f(s, yi (s)) - f(s,V2(s))\\ < k ■ \\VM - V2(s)\\ < k ■ \\ Vl - y 2 \\ 
for every s G J r . Thus, we have 

Il0(2/i) - 0(2/2)|| = II f f(s,yi(s)) - f(s,y 2 (s)) ds\\ 

Jt 

< I \\f(s, yi (s))-f( S ,y 2 (s))\\ds 

Jt 

< kr ■ \\yi - y 2 \\ . 
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On the other hand, we have for every s € J r 



||/(8,I,00)|| <M 



and so 



\\4>{xo) - xq 




We can apply the above lemma, if kr < 1 and Mr < b(l — kr). This is 
the case, if r < b/(M + kb). By choosing r small enough, we can get the 



Definition. A set X with a distance function d(x, y) for which the following 
properties 

(i) d(y, x) = d(x, y) > for all x, y £ X. 

(ii) d(x, x) = and d(x, y) > for x ^ y. 

(iii) d(x, z) < d(x, y) + d(y, z) for all x, y, z. hold is called a metric space. 

Example. The plane M 2 with the usual distance d(x, y) = \x — y\. An other 
metric is the Manhattan or taxi metric d(x, y) — \x\ — y\ \ + \x% — j/2 1 ■ 

Example. The set C([0, 1]) of all continuous functions x(t) on the interval 
[0, 1] with the distance d(x,y) = maxi \x(t) — y(t)\ is a metric space. 

Definition. A map <fi : X — > X is called a contraction, if there exists A < 1 
such that d(4>(x), <f>(y)) < A • d(x, y) for all x, y € X . The map 4> shrinks the 
distance of any two points by the contraction factor A. 

Example. The map 4>(x) = ^x + (1, 0) is a contraction on R 2 . 

Example. The map 4>(x)(t) = sin(t)x(t) + t is a contraction on C([0,1]) 



because \</>(x)(t) - 4>{y){t)\ = \ sin(*)| • \x{t) - y{t)\ < sin(l) • \x(t) - y(t)\. 



Definition. A Cauchy sequence in a metric space (X, d) is defined to be a 
sequence which has the property that for any e > 0, there exists no such 
that \x n — x m \ < e for n > no,m > uq. 

A metric space in which every Cauchy sequence converges to a limit is 
called complete. 

Example. The n-dimensional Euclidean space 



is complete. The set of rational numbers with the usual distance 



contraction rate as small as we wish. 



□ 




(Q,d(x,y) = \x-y\) 



is not complete. 
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Example. The space C[0, 1] is complete: given a Cauchy sequence x n , then 
x n (t) is a Cauchy sequence in R for all t. Therefore x n {t) converges point 
wise to a function x(t). This function is continuous: take e > 0, then \x(t) — 
x(s)\ < \x(t) - x n (t)\ + \x n (t) - y n {s)\ + \y n (s) - y{s)\ by the triangle 
inequality. If s is close to t, the second term is smaller than e/3. For large 
n, \x(t) - x n (t)\ < e/3 and \y n (s) - y(s)\ < e/3. So, \x(t) - x(s)\ < e if 
\t — s\ is small. 



Theorem 4.19.5 (Banachs fixed point theorem). A contraction <fi in a com- 
plete metric space (X, d) has exactly one fixed point in X. 



Proof, (i) We first show by induction that 

d(<j> n (x), 4> n {y)) < A" • d(x, y) 

for all n. 

(ii) Using the triangle inequality and J^k A fc = (1 — A) -1 , we get for all 

n— 1 n— 1 .. 

d(x, (j) n x) < J2 d (<t> k x, <t> k+X x) < J2 xk<i ( x > ^ x )) ^ \ — X ' d<yX ' • 

k=0 k=0 

(iii) For all x e X the sequence x„ = (f> n (x) is a Cauchy sequence because 

by (i),(h), 

d(x n , x n+k ) < A" • d{x, x k ) < A™ • j—\ ' d ( x ' Xl ) ■ 
By completeness of X it has a limit a: which is a fixed point of 0. 

(iv) There is only one fixed point. Assume, there were two fixed points x, y 
of tj>. Then 

d(x, y) = d(<f)(x),<p(y)) < X ■ d(x, y) . 
This is impossible unless x — y. □ 
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Chapter 5 

Selected Topics 



5.1 Percolation 

Definition. Let e, be the standard basis in the lattice 7L d . Denote with h d 
the Cayley graph of Z d with the generators A = {ei, . . . , e& }. This graph 
L d = (V,E) has the lattice Z d as vertices. The ed ges or bonds in that 
graph are straight line segments connecting neighboring points x, y. Points 
satisfying \x-y\= £\ =1 \xi -y%\ = 1. 



Definition. We declare each bond of h d to be open with probability p £ 
[0, 1] and closed otherwise. Bonds are open ore closed independently of all 
other bonds. The product measure P p is defined on the probability space 
O = rieeslOi 1} °f a U configurations. We denote expectation with respect 
to P p with E p [-]. 

Definition. A path in L d is a sequence of vertices (xq, x\, . . . , x n ) such that 
(xi, Xi+i) = ei are bonds of h d . Such a path has length n and connects xq 
with x n ■ A path is called open if all its edges are open and closed if all its 
edges are closed. Two subgraphs of L rf are disjoint if they have no edges 
and no vertices in common. 



Definition. Consider the random subgraph of h d containing the vertex set 
Z d and only open edges. The connected components of this graph are called 
open clusters. If it is finite, an open cluster is also called a lattice animal. 
Call C[x) the open cluster containing the vertex x. By translation invari- 
ance, the distribution of C{x) is independent of x and we can take x = 
for which we write C(0) = C. 
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Figure. A lattice animal. 



Definition. Define the percolation probability 8(p) being the probability 
that a given vertex belongs to an infinite open cluster. 

oo 

%)=P[|C| = oo] = l-^P[|C|=n]. 



One of the goals of bond percolation theory is to study the function 0(p). 



Lemma 5.1.1. There exists a critical value p c = p c {d) such that 6(p) = 
for p < p c and 9(p) > for p > p c . The value d i— > p c (d) is non-increasing 
with respect to the dimension p c (d + 1) < p c {d). 



Proof. The function p M> 9(p) is non-decreasing and 9(0) = 0, 9(1) = 1. We 
can therefore define 

Pc = M{p G [0, 1] | 8(p) > }. 

The graph Z d can be embedded into the graph Z d for d < d! by realizing Z d 
as a linear subspacc of 7L d parallel to a coordinate plane. Any configuration 
in L rf projects then to a configuration in h d . If the origin is in an infinite 
cluster of Z d , then it is also in an infinite cluster of 1 d . □ 

Remark. The one-dimensional case d = 1 is not interesting because p c = 1 
there. Interesting phenomena are only possible in dimensions d > 1. The 
planar case d = 2 is already very interesting. 

Definition. A self-avoiding random walk in h d is the process St obtained 
by stopping the ordinary random walk S n with stopping time 

T(lo) = inf{rt 6 N | uj(n) = Lo(m), m < n} . 

Let cr(n) be the number of self-avoiding paths in L d which have length n. 
The connective constant of L d is defined as 

\(d) = lim cr(n) 1 /™ . 
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Remark. The exact value of X(d) is not known. But one has the elementary 
estimate d < X(d) < 2d — 1 because a self-avoiding walk can not reverse 
direction and so cr(n) < 2d(2d — l)™ -1 and a walk going only forward 
in each direction is self-avoiding. For example, it is known that A(2) £ 
[2.62002, 2.69576] and numerical estimates makes one believe that the real 
value is 2.6381585. The number c„ of self-avoiding walks of length n in L 2 is 
for small values c\ — 4, c 2 = 12, c 3 = 36, c 4 = 100, c 5 = 284, c§ = 780, cj = 
2172,.... Consult [64] for more information on the self-avoiding random 
walk. 



Theorem 5.1.2 (Broadbent-Hammersley theorem). If d > 1, then 
0< A(d)- 1 <p c (d) < Pc {2) < 1 . 



Proof. (i) Pc (d)> A(d)- 1 . 

Let N(n) < o~{n) be the number of open self-avoiding paths of length n in 
L™. Since any such path is open with probability p n , we have 

E p [N(n)] = p n a{n) . 

If the origin is in an infinite open cluster, there must exist open paths of 
all lengths beginning at the origin so that 

0(p) < P P [N(n) > 1] < E p [N(n)} = p n a(n) = (p\(d) + o(l))" 

which goes to zero for p < A(p) _1 . This shows that Pc (d) > X(d)~ 1 . 

(ii) Pc (2) < 1. 

Denote by L 2 the dual graph of L 2 which has as vertices the faces of L 2 and 
as vertices pairs of faces which are adjacent. We can realize the vertices as 
Z 2 + (1/2, 1/2). Since there is a bijective relation between the edges of L 2 
and L 2 and we declare an edge of L 2 to be open if it crosses an open edge 
in L 2 and closed, if it crosses a closed edge. This defines bond percolation 
on L 2 . 

The fact that the origin is in the interior of a closed circuit of the dual 
lattice if and only if the open cluster at the origin is finite follows from the 
Jordan curve theorem which assures that a closed path in the plane divides 
the plane into two disjoint subsets. 

Let p(n) denote the number of closed circuits in the dual which have length 
n and which contain in their interiors the origin of L 2 . Each such circuit 
contains a self-avoiding walk of length n — 1 starting from a vertex of the 
form {k + 1/2, 1/2), where < k < n. Since the number of such paths 7 is 
at most na(n — 1), we have 

p(n) < na(n — 1) 
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and with q = 1 — p 

oo oo 

P[7 is closed] < 1 n no{n - 1) = ^ gn(<?A(2) + o(l))' 1 - 1 

7 n—1 n—1 

which is finite if gA(2) < 1. Furthermore, this sum goes to zero if q — > so 
that we can find < 5 < 1 such that for p> 8 

^P[7 is closed] < 1/2. 

7 

We have therefore 

P[|C| = oo] = P[no 7 is closed] > 1 - ^ p il is closed] > 1/2 

7 

so that p c (2) < S < 1. □ 

Remark. We will see below that even p c (2) < 1 — A(2) _1 . It is however 
known that p c (2) = 1/2. 

Definition. The parameter set p < p c is called the sub-critical phase, the 
set p > p c is the supercritical phase. 

Definition. For p < p c , one is also interested in the mean size of the open 
cluster 

x(p) = E p [\C\) . 

For p > p c , one would like to know the mean size of the finite clusters 

X f (p) = E P [|C| | |C| < oo] . 

It is known that x(p) < °° f° r P < Pc but only conjectured that x^(p) < 00 
for p > p c - 

An interesting question is whether there exists an open cluster at the critical 
point p = p c . The answer is known to be no in the case d = 2 and generally 
believed to be no for d > 3. For p near p c it is believed that the percolation 
probability 6(p) and the mean size x(p) behave as powers of \p — p c \. It is 
conjectured that the following critical exponents 

,. logxO) 
7 = — lim 



j3 = lim 



p/* Pc log \p - p c 
log0(p) 



P\Po log |p - p c 



5- 1 = - lim 



logP Pc [|C|>n 



logn 
exist. 



Percolation deals with a family of probability spaces (£l,A, P p ), where 
SI = {0, 1} L is the set of configurations with product cr-algebra A and 
product measure P p = (p, 1 — p) L . 
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Definition. There exists a natural partial ordering in Q coming from the 
ordering on {0,1}: we say w < to' , if u>(e) < o/(e) for all bonds e £ L 2 . 
We call a random variable X on (fl,A,P) increasing if u> < co' implies 
X(lo) < X{lu'). It is called decreasing if —X is increasing. As usual, this 
notion can also be defined for measurable sets A G A: a set A is increasing 
if 1a is increasing. 



Lemma 5.1.3. If X is a increasing random variable in C 1 ^, P 9 )n£ 1 (fi, P p ), 
then 

E P [X] < E q [X] 

if p < g. 



Proof. If X depends only on a single bond e, we can write E p [A] = pJT(l) + 
(1 — p)X(O). Because X is assumed to be increasing, we have ^E p LY] = 
X{1) - X(0) > which gives E P [X] < E q [X] for p < q. If X depends only 
on finitely many bonds, we can write it as a sum X = ^ i=1 Xi of variables 
Xi which depend only on one bond and get again 

^E P [X] = ^(X(1)-X(0))>0. 

In general we approximate every random variable in C 1 (fi, P p ) fl C 1 (f2, P q ) 
by step functions which depend only on finitely many coordinates Xi . Since 
E p [Xi] E P [X] and E q [Xi] E q [X], the claim follows. □ 

The following correlation inequality is named after Fortuin, Kasterleyn and 
Ginibre (1971). 



Theorem 5.1.4 (FKG inequality). For increasing random variables X,Y E 
£ 2 (J7,P P ), we have 

E P [XY] >E p [X]-Ep[Y] . 



Proof. As in the proof of the above lemma, we prove the claim first for ran- 
dom variables X which depend only on n edges ei, ■ ■ ■ , e„ and proceed 
by induction. 

(i) The claim, if X and Y only depend on one edge e. 
We have 

(X(uj) - X(u')(Y(ui) - Y{uj')) > 
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since the left hand side is if w(e) = u/(e) and if 1 = u>(e) = cj'(e) = 0, both 
factors are nonnegative since X, Y are increasing, if = u>(e) = oj'(e) = 1 
both factors are non-positive since X, Y are increasing. Therefore 

< J2 { X (")- X ("'))( Y (")- Y ("')) P pn e )=<r} P P M e )=<r'] 

o-,ct'G{0 : 1} 

= 2(E P {XY]-E P {X]E P {Y}) . 

(ii) Assume the claim is known for all functions which depend on k edges 
with k < n. We claim that it holds also for X, Y depending on n edges 
ei, e 2 , • ■ • , e„. 

Let Ak = A(e±, . . . e k ) be the cr-algebra generated by functions depending 
only on the edges e k - The random variables 

X k = E p [X\A k ],Y k = E p [Y\A k ] 

depend only on the ei, . . . , e k and are increasing. By induction, 

E p [X„_iY„_i] > Ep[X„_i]Ep[Y^_i] . 

By the tower property of conditional expectation, the right hand side is 
Ep[X]E p [y]. For fixed e±, . . . , e„_i, we have (XY) n _i > X n _{Y n _\ and so 

E P [XY] = Ep[(XY)n-i] > E p [X„_ir n _i] . 

(iii) Let X, Y be arbitrary and define X n = E P [X\A„], Y n = E p [Y\A n }- We 
know from (ii) that E p [X n Y n ] > E p [X n ]E p [Y n }. Since X n = E[X\A„] and 
Y n = E[X\A n ] are martingales which are bounded in C 2 (fl,P p ), Doob's 
convergence theorem (3.5.4) implies that X n — > X and Y n — > Y in C 2 and 
therefore E[X n ] — > E[X] and E[Y n ] —>E[Y]. By the Schwarz inequality, we 
get also in C 1 or the C 2 norm in (Q,A, P p ) 

WX^-XY^ < \\(X n -X)Y n \\ 1 + \\X(Y n -Y)\\ 1 

< ||x„-x|| 2 ||y„|| 2 + ||x|| 2 ||y n -y|| 2 

< c(||x„-x|| 2 + ||y„-y|| 2 )^o 

where C = max(||_X"|| 2 , ||y|| 2 ) is a constant. This means E p [X n Y n ] — > 
E P [XY]. □ 

Remark. It follows immediately that if A, B are increasing events in Q, 
then P p [A CiB}> P p [A] ■ P p [B}. 

Example. Let L, be families of paths in and let Ai be the event that 
some path in Pi is open. Then Ai are increasing events and so after applying 
the inequality k times, we get 

k k 
i=l i=l 
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We show now, how this inequality can be used to give an explicit bound for 
the critical percolation probability p c in L 2 . The following corollary belongs 
still to the theorem of Broadbcnt-Hammcrsley. 



Corollary 5.1.5. 

Pc (2)<(l-A(2)- 1 ). 



Proof. Given any integer N € N, define the events 

F/v = { 3 no closed path of length < N in } 
Gjv = {3 no closed path of length > N in L*} . 

We know that Fjy HGat C {|C| = oo}. Since Fn and Gjv are both increas- 
ing, the correlation inequality says P p [Fn H Gn] > P p [Fn] ■ P p [Gn]- Wc 
deduce 

6{p) = P P [\C\ = oo] = P P [F N n G N ] > P P [F N ] ■ P P [G N ] . 
If (1 — p)A(2) < 1, then we know that 

oo 

P P [G C N ] < ^2(l-p) n na(n-l) 

7l=N 

which goes to zero for N — > oo. For large enough, we have therefore 
P p [Gjv] > 1/2. Since also P P [F N ] > 0, it follows that B p > 0, if (l-p)A(2) < 
1 or p < (1 — A(2) _1 ) which proves the claim. □ 

Definition. Given A £ A and lo € f2. We say that an edge e S L d is pivotal 

for the pair (A,uj) if Ia(u) ^ Ia^c), where w e is the unique configuration 
which agrees with w except at the edge e. 



Theorem 5.1.6 (Russo's formula). Let A be an increasing event depending 
only on finitely many edges of L d . Then 

±P P {A] = E P [N(A)) , 

where N(A) is the number of edges which are pivotal for A. 



Proof, (i) Wc define a new probability space. 

The family of probability spaces (£l,A, P p ), can be embedded in one prob- 
ability space 

([0,l] L >([0,l] Ld ),P) , 
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where P is the product measure dx h . Given a configuration rj g [0, 1] L and 
p € [0, 1], we get a configuration in ft by defining rj p (e) = 1 if 77(e) < p and 
T] p = else. More generally, given p <G [0, 1 ] Ld , we get configurations r] p (e) = 
1 if 77(e) < p(e) and r/ p = else. Like this, we can define configurations 
with a large class of probability measures P p = ELen^ (p( e )' 1 ~ P( e )) with 
one probability space and we have 

P p [A\=P[r, p eA]. 

(ii) Derivative with respect to one p(f). 

Assume p and p' differ only at an edge / such that p(f) < p'(f). Then 
{i] p 6 A} C {r/ p ' e A} so that 

P p ,[A]-P p [A] = P[ V e A] - Pfop e A] 
= P[v P > e A; Vp (£ A] 
= (p'(f) - P(f))P P [f pivotal for A] . 

Divide both sides by (p'{f) — p{f)) and let p'{f) — > p{f)- This gives 

d 



dp(f) 



P P [A] = P p [f pivotal for A] 



(iii) The claim, if A depends on finitely many edges. If A depends on finitely 
many edges, then P P [A] is a function of a finite set {p{fi) }™i of edge 
probabilities. The chain rule gives then 



d m d 

-P P [A] = g^yP P [A]| p=CPiPiPi .„, rt 



P p [/i pivotal for A] 



= Vp[N(A)}. 

(iv) The general claim. 

In general, define for every finite set F C E 

p F {e) =p+l {eeF }5 

where 0<p<p + 5<l. Since A is increasing, we have 

P P+ s[A]>P PF [A] 

and therefore 

^(Pp +S [A} - P P [A}) > j(P PF [A) -P P [A])^J2 p p^ e P ivotal for A \ 

as S — > 0. The claim is obtained by making F larger and larger filling out 
E. □ 
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Example. Let F = {ei, e2, • ■ • , e m } C E be a finite set in of edges. 

A = {the number of open edges in F is > fc} . 

An edge e € F is pivotal for A if and only if j4 \ {e} lias exactly k — 1 open 
edges. We have 

P p [e is pivotal] = ( ™ ~ 1 J p fe - 1 (l - p) m - fc 



so that by Russo's formula 

d 
dp 



d P P [A] = ^ Pp[ e is pivotal] = m ( ™_ * J p k 1 (l - p) 

e£F ^ ' 



m—k 



Since we know Po[^4] = 0, we obtain by integration 

m , s 



l=k 



Remark. If A docs no more depend on finitely many edges, then P p [A] 
need no more be diffcrcntiable for all values of p. 

Definition. The mean size of the open cluster is \(p) = E p [|C|]. 



Theorem 5.1.7 (Uniqueness). For p < p c , the mean size of the open cluster 
is finite x(p) < 00 ■ 



The proof of this theorem is quite involved and we will not give the full 
argument. Let S(n, x) = {y £ Z d | \x — y\ = Yj%=i \ x A — n } ^ c * ne ball of 
radius n around x in Z d and let A n be the event that there exists an open 
path joining the origin with some vertex in SS(n,0). 



Lemma 5.1.8. (Exponential decay of radius of the open cluster) If p < p c 
there exists ifj p such that P p [A„] < e - "^ p . 



Proof. Clearly, |5(n,0)| < C d ■ (n + l) d with some constant C d . Let M = 
max{n | A„ occurs }. By definition of p Cl if p < p c , then P p [M < oo] = 1. 
We get 

E P [|C|] < ^E p [|C| | M = n] ■ P P [M = n] 

n 

< ^|5(n,0)|P P K] 

n 

< ^C rf (n + l)V n ^ < co. 
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□ 



Proof. We are concerned with the probabilities g p (n) = P p [A n }. Sine A n 
are increasing events, Russo's formula gives 

g' p (n) = E p [N(A n )} , 

where N(A n ) is the number of pivotal edges in A n . We have 

9 P ( n ) = ^ p [ e Pi vota l f° r A] 

e 

= — Pple open and pivotal for A] 
— J p 

e 

= -P P [A n {e pivotal for A}] 

e ^ 

= ^-P p L4n{e pivotal for A}\A] -P P [A] 

e 
e 

= ^2lE p [N(A)\A]-g p (n) 



so that 

g'M _ 1 



E p [N(A n ) | Ar, 



g p {n) p 
By integrating up from a to /3, we get 

rP i 

g a (n) = gp(n)exp(- / -E p [N(A n ) \ A n ] dp) 
J a P 

rP 

< g (n)exp(- / E p [N(A n ) | A n ] dp) 

J a 

f' 3 

< exp(- / E p [N(A n ) | A n ] dp) . 

J a 

One needs to show then that E p [N(A n ) \A n ] grows roughly linearly when 
p < p c . This is quite technical and we skip it. □ 

Definition. The number of open clusters per vertex is defined as 
K (p)=E p [\C\- 1 ] = J2-P P [\C\ = n]. 

n— 1 

Let B n the box with side length 2n and center at the origin and let K n be 
the number of open clusters in B n . The following proposition explains the 
name of k. 
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Proposition 5.1.9. In ^(fljA, P p ) we have 

K n /\B n \ -> «(p) . 



Proof. Let C n (x) be the connected component of the open cluster in B n 
which contains x <G Z d . Define T(x) = |C(x)| _1 . 

(i) J2xeB n T n(x) = K n . 

Proof. If S is an open cluster of B n , then each vertex x € E contributes 
to the left hand side. Thus, each open cluster contributes 1 to the 
left hand side. 

(ii) ^ > p^£, eB „i>) where T(x) = \C(x)\-\ 
Proof. Follows from (i) and the trivial fact r(a;) < r n (:r). 

(iii) jit] E, eS „ rfc) -> E p [T(0)] = «(p). 

Proof. r(x) are bounded random variables which have a distribution which 
is invariant under the ergodic group of translations in 1 d . The claim follows 
from the ergodic theorem. 

(iv) lim inf n^oo rg^r > n(p) almost everywhere. 
Proof. Follows from (ii) and (iii). 

( v ) Exei3(n) T n(x) < Exes(n) T ( x ) + T, X ~5B„ ^(aO, where x ~ y means 
that x is in the same cluster as one of the elements y € y C Z d . 

H E, eB „ r n (x) < |^ E, 6B „ r(x) + Jg4. □ 



Remark. It is known that function «(p) is continuously differentiate on 
[0, 1]. It is even known that k and the mean size of the open cluster xip) are 
real analytic functions on the interval [0,p c ). There would be much more 
to say in percolation theory. We mention: 
The uniqueness of the infinite open cluster: 

For p > p c and if 0(p c ) > also for p = p c , there exists a unique infinite 
open cluster. 

Regularity of some functions 8(p) 

For p > p c , the functions 0(p),x^ (p), «(p) are differentiable. In general, 

9(p) is continuous from the right. 

The critical probability in two dimensions is 1/2. 
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5.2 Random Jacobi matrices 



Definition. A Jacobi matrix with IID potential Vu(n) is a bounded self- 
adjoint operator on the Hilbert space 

oo 

l 2 (Z) = {(..., x- 1 ,x ,x 1 ,x 2 ...) I x *> = 1 J' 

fc— — oo 

of the form 

L w u(n) = ^ u(m) + V u (n)u(n) = (A + K„)(u)(n) , 

\m—n\—l 

where V u (n) are IID random variables in These operators arc called 
discrete random Schrodinger operators. We are interested in properties of 
L which hold for almost all cj 6 O. In this section, we mostly write the 
elements uj of the probability space (fi, A, P) as a lower index. 

Definition. A bounded linear operator L has pure point spectrum, if there 
exists a countable set of eigenvalues A; with cigenfunctions fa such that 
Lfa = Xi4>i and fa span the Hilbert space ^ 2 (Z). A random operator has 
pure point spectrum if has pure point spectrum for almost all w S 0. 

Our goal is to prove the following theorem: 



Theorem 5.2.1 (Frohlich-Spcnccr). Let V(n) are IID random variables with 
uniform distribution on [0,1]. There exists Ao such that for A > Ao, the 
operator L u = A + A • V u has pure point spectrum for almost all uj. 



We will give a recent elegant proof of Aizenman-Molchanov following [98] . 
Definition. Given E G C \ R, define the Green function 

G u (m,n,E) = [{L u - E)- 1 }^ . 

Let fi = \i u be the spectral measure of the vector eo. This measure is 
defined as the functional C(M) ->• R, / i-> /(L w )oo by /(L w )oo = E[/(L) o]- 
Define the function 

= f My)_ 

Jr y - z 

It is a function on the complex plane and called the Borel transform of the 
measure /x. An important role will play its derivative 



F'(z) 



dfxjX) 

(y - z) 2 
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Definition. Given any Jacobi matrix L, let L a be the operator L + aPo, 
where Po is the projection onto the one-dimensional space spanned by Jo- 
One calls L a a rank-one perturbation of L. 



Theorem 5.2.2 (Integral formula of Javrjan-Kotani). The average over all 
specral measures d[i a is the Lebesgue measure: 

dfi a da = dE . 



Proof. The second resolvent formula gives 

(L a - z)- 1 — (L — z)- 1 = -a(L a - 2)- 1 P (i - z)- 1 . 
Looking at 00 entry of this matrix identity, we obtain 

F a (z)-F{z) = -aF a (z)F(z) 
which gives, when solved for F a , the Aronzajn-Krein formula 

F (z) F{Z) 
a[ ' ~ \ + aF{z) ■ 

We have to show that for any continuous function / : C — > C 

f(x) dfi a (x) da= f(x) dE(x) 

and it is enough to verify this for the dense set of functions 

Contour integration in the upper half plane gives J R f z (x) dx = for 
Im(,z) < and 2iri for Im(z) > 0. On the other hand 



f z {x)d^ a (x) = F a (z) - F a (-i) 
which is by the Aronzajn-Krain formula equal to 
h z (a) 



a + Fiz)- 1 a + Fi-i)- 1 ' 

Now, if ±Im(z) > 0, then ±ImF(z) > so that ilmffz)" 1 < 0. This 
means that h z (a) has either two poles in the lower half plane if Im(z) < 
or one in each half plane if Im(z) > 0. Contour integration in the upper 
half plane (now with a) implies that J R h z (a) da = for Im(z) < and 
2iri for Im(z) > 0. □ 
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In theorem (2.12.2), we have seen that any Borel measure fj, on the real line 
has a unique Lebesgue decomposition d[i = dfi ac + dosing = d^i ac + <i// sc + 
d/ipp. The function F is related to this decomposition in the following way: 



Proposition 5.2.3. (Facts about Borel transform) For e-)0, the measures 
TT^ 1 lmF(E + £e) dE converges weakly to \x. 
dfx sing ({E | JmF(E + iO) = oo }) = 1, 
dfj,({E }) = lim £ ^o ImF(£b + ie)e, 
dn ac (E) = 7r- 1 ImF(£: + £0) dE. 



Definition. Define for a ^ the sets 

S a = {x e M | F(x + i0) = -or 1 , F'(x) = oo } 
P a = {x e R I F(x + i0) = -aT 1 , F'(x) < oo } 
L = {x e R | lmF(x + £0) ^ } 



Lemma 5.2.4. (Aronzajn-Donoghue) The set P a is the set of eigenvalues of 
L a . One has (dfj, a ) sc (S a ) = 1 and (c£/x a ) oc (L) = 1. The sets P a ,S a ,L are 
mutually disjoint. 



Proo/. If F{E + £0) = -1/a, then 

limelmfJE + ie) = (a 2 F'(E))' 2 

e— >0 

since F(E+ie) = -l/a+ieF'(x)+o(e) HF'(E) < oo and e _1 Im(l + aF) -> 
oo if F'(E) = oo which means e|l + aF\ _1 — > and since F — > one 
gets e\F/(l + aF)\ -> 0. 

The theorem of de la Vallee Poussin (see [92] ) states that the set 

{E\ \F a {E + iQ)\=oo} 

has full (dfi a ) s ing measure. Because F a = F/(l + aF), we know that 
\F a (E + £0)| = oo is equivalent to F(E + £0) = -1/a. □ 



The following criterion of Simon- Wolff [100] will be important. In the case of 
IID potentials with absolutely continuous distribution, a spectral averaging 
argument will then lead to pure point spectrum also for a = 0. 
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Theorem 5.2.5 (Simon- Wolff criterion). For any interval [a, b] C K, the 
random operator L has pure point spectrum if 

F'(E) < oo 

for almost almost all Eg [a,b]. 



Proof. By hypothesis, the Lebesgue measure of S = {E \ F'(E) — oo } is 
zero. This means by the integral formula that dfj, a {S) = for almost all a. 
The Aronzajn-Donoghuc lemma (5.2.4) implies 

fJ, a (S a n [a, b}) = fi a (L n [a, b}) = 

so that fi a has only point spectrum. □ 



Lemma 5.2.6. (Formula of Simon-Wolff) For each E e R, the sum 
SnGZ K-k — E — * e )o« I 2 increases monotonically as e \ and converges 
point wise to F'(E). 



Proof. For e > 0, we have 

Y,\(L-E-ze)^\ 2 = \\(L-E-ier%\\ 2 

nGZ 

= WiL-E-ier^L-E + ie)- 1 }^ 
du.{x) 



(x - E) 2 + e 2 

from which the monotonicity and the limit follow. □ 

Lemma 5.2.7. There exists a constant C, such that for all a, f3 € C 

\x - a\ 1/2 \x - (3\- 1/2 dx>C [ \x- (3\- 1/2 dx . 



Proof. We can assume without loss of generality that a £ [0, 1], because 
replacing a general a € C with the nearest point in [0, 1] only decreases the 
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left hand side. Because the symmetry a t— > 1 — a leaves the claim invariant, 
we can also assume that a <G [0, 1/2]. But then 

1 \x - a| 1/2 |x - /T 1/2 dx > (h 1 ' 2 I' \x - pr 1 ' 2 dx . 



The function 

f^x-alWlx-Pl-Wdx 
is non-zero, continuous and satisfies h(oo) = 1/4. Therefore 

C := inf h(B) > . 

pec 

The next lemma is an estimate for the free Laplacian. 



□ 



Lemma 5.2.8. Let f,g 6 i°°(Z) be nonnegative and let < a < (2d) 
(l-oA)/< S =*-/< (1-aA)" 1 ,?. 
[(1 - aA)- 1 }^ < (2da) |j -' l| (l - 2da)- 1 . 



-1 



Proof. Since ||A|| < 2d, we can write (1 — aA) 1 = X)m=o( aA ) m wmcn is 
preserving positivity. Since [(aA)" 1 ]^- = for m < \i — j\ we have 

00 00 

m=|i-j| m=|i-j| 

□ 

We come now to the proof of theorem (5.2.1): 



Proof. In order to prove theorem (5.2.1), we have by Simon- Wolff only to 
show that F'(E) < 00 for almost all E. This will be achieved by proving 
ElF'iE) 1 / 4 } < 00. By the formula of Simon- Wolff, we have therefore to 
show that 

supEK^IGKO^)! 2 ) 1 / 4 ] <co. 



Since 



(£igko,z)i 2 ) i/4 <ei g ^ < z )i 1/: 

n n 
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we have only to control the later the term. Define g z (n) = G(n,0,z) and 
k z {n) = E[|(7 2 (n)| 1 / 2 ]. The aim is now to give an estimate for 

raGZ 

which holds uniformly for Im(z) ^ 0. 
(i) 

El\\V(n)-z\ 1 / 2 \g z (n)\ 1 / 2 }<5 n , + £ k z (n + j). 

b'l=i 

Proof. (L — z)g z (n) = S n o means 

(XV (n) - z)g z (n) = S n0 - ^ g z (n + j) . 

Iil=i 

Jensen's inequality gives 

e[|av^(^) - ^i 1 / 2 !^^)! 1 / 2 ] <<5 n0 + J2 + ■ 

lil=i 

(") 

E[\XV(n) - z\^ 2 \g z (n)\^ 2 } > CX 1 ' 2 ^) . 

Proof. We can write g z (n) = A/(XV(n) + B), where A,B are functions of 
{V(l)}i^ n - The independent random variables V(k) can be realized over 
the probability space O = [0, l] z = Y\ keIj VL(k). Wc average now |AV(n) — 
z\ 1/2 \g z {n)\ 1/2 over £l(n) and use an elementary integral estimate: 

Jn(n) \Xv + B\ L ^ J 

> C\A\^ 2 f \v + BX~ 1 \- 1 ' 2 dv 
Jo 

= CX 1 ' 2 f \A/{Xv + B)\^ 2 
Jo 

= n 9z (n) 1 / 2 ] = k z (n) . 

(in) 

k z (n) < [CX 1 ' 2 )' 1 I Y, k z (n + j) + S n0 
\b'l=i 

Proof. Follows directly from (i) and (ii). 
(iv) 

(1 - CX 1/2 A)k < S n0 . 

Proof. Rewriting (iii). 
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(v) Define a = C\ 1 ' 2 . 
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k z (n) < a- 1 (2d/a) |n| (l - 2d/a)- 1 . 

Proof. For Im(z) 7^ 0, we have k z € l°°(Z). From lemma (5.2.8) and (iv), 
we have 

k(n) < cT l [(l - A/a)- l ]o„ < a-^-jWfl - -)" 1 . 

a a 

(vi) For A > 4C -2 , we get a uniform bound for ^2 n k z (n). 
Proof. Since CA 1 / 2 < 1/2, we get the estimate from (v). 

(vii) Pure point spectrum. 

Proof. By Simon- Wolff, we have pure point spectrum for L a for almost all 
a. Because the set of random operators of L a and Lq coincide on a set of 
measure > 1 — 2a, we get also pure point spectrum of for almost all 

LU. □ 



5.3 Estimation theory 

Estimation theory is a branch of mathematical statistics. The aim is to 
estimate continuous or discrete parameters for models in an optimal way. 
This leads to cxtremization problems. We start with some terminology. 

Definition. A collection (CI, A, Pg) of probability spaces is called a statis- 
tical model. If X is a random variable, its expectation with respect to the 
measure Pg is denoted by Eg[X], its variance is Varg[X] = Eg[(X— Eg[A]) 2 ]. 
If X is continuous, then its probability density function is denoted by fg. 
In that case one has of course Eg[X] = J n fg(x) dx. The parameters 9 are 
taken from a parameter space 0, which is assumed to be a subset of M. or 
R k . 

Definition. A probability distribution [i = p(6) d9 on (0,2?) is called an 
a priori distribution on C M. It allows to define the global expectation 

E[X] = J e E e [X]d»(6). 

Definition. Given n independent and identically distributed random vari- 
ables Xi, . . . , X n on the probability space (fl, A, Pg), we want to estimate 
a quantity g(8) using an estimator T(oj) = t(Xi(uj), . . . ,X n (uj)). 

Example. If the quantity g(0) = ~Ejg[Xi\ is the expectation of the ran- 
dom variables, we can look at the estimator T(ui) = — X)j=i ^"i( a -')> the 
arithmetic mean. The arithmetic mean is natural because for any data 
X\, . . . ,x n , the function f(x) = ^2™ =1 (xi — x) 2 is minimized by the arith- 
metic mean of the data. 

Example. We can also take the estimator T(lu) which is the median of 

X\(oj), . . . ,X n (uj). The median is a natural quantity because the function 
f( x ) ~ £™=i \ Xi ~ x \ i s minimized by the median. Proof. \a — x\ + \b — x\ — 
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\b — a\ + C(x), where C(x) is zero if a < x < b and C(x) = x — b if 
x > b and D(x) = a — x if x < a. If n = 2m + 1 is odd, we have f(x) = 
Y^JLi \ x i~ x n+i-i\+Y^xj>xm C( x j)+J2 Xj <x m D ( x j) which is minimized for 
x = x m . Ifn = 2m. we have /(x) = 5Z"=i l^ _a; ™+i-il+S^>x m+ i Cf ( a; j) + 
^ Ij<Im l D(xj) which is minimized for x £ [x m ,x m +i]. 

Example. Define the bias of an estimator T as 

B(6)=B e [T]=E e [T]-g{9) . 

The bias is also called the systematic error. If the bias is zero, the estimator 
is called unbiased. With an a priori distribution on O, one can define the 
global error B(T) = J e B{6) du.{9). 



Proposition 5.3.1. A linear estimator T{uS) = 53?= l otiXi(u>) with J2 i on 
1 is unbiased for the estimator g(0) = Eg[Xi\. 



Proof. E [T] = £™ OL&e [X t ] = Eg [X,] . □ 



Proposition 5.3.2. For g(0) = Varg[Ai] and fixed mean m, the estimator 
T = — ^2™ = i(Xi — m) 2 is unbiased. If the mean is unknown, the estimator 

T = £?=i(-Xi - X? wit h X = \ E" =1 X t is unbiased. 



Proof, a) E e [T] = ± £? =1 (*< - ™) 2 = Var e [T] = g(6). 
b) For T - I ~ A 7 ,) 2 , we get 

E e [T] = E^-E^^A^-] 

= E^A?] - l^A 2 ] - ^^E e [X,] 2 
= (l-I)E 9 [A 2 ]-^-±E e [Aj 2 

Jl — 1 

-VarefA,] . 



n 

Therefore n/(n — 1)T is the correct unbiased estimate. □ 

Remark. Part b) is the reason, why statisticians often take the average of 
77 



7— ~! (xi — x) 2 as an estimate for the variance of n data points Xi with mean 



if the actual mean value m is not known. 
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Definition. The expectation of the quadratic estimation error 

Err e [T] = E e [(T - g(6)) 2 ] 

is called the risk function or the mean square error of the estimator T. It 
measures the estimator performance. We have 

Err 9 [T] = Var e [T] + B e [T] , 

where Bg[T] is the bias. 

Example. If T is unbiased, then Err e [T] = Varg [T] . 

Example. The arithmetic mean is the "best linear unbiased estimator". 
Proof. With T = J^. ctiXi, where J^i a i = 1) the risk function is 

Err e [T] = Var e [T] = a?Var . 

i 

It is by Lagrange minimal for ctfj = 1/n. 

Definition. For continuous random variables, the maximum likelihood func- 
tion t(xi, . . . , x n ) is defined as the maximum of 6 i— > Lq(x\, . . . , x n ) := 
fe(xi) fe(x n ). The maximum likelihood estimator is the random vari- 
able 

T(w)=t(X 1 (w),...,X n (w)) . 
For discrete random variables, L$(xi, . . . , x n ) would be replaced by Pe[Xi = 

X\ , . . . , X n — Xn\ • 

One also looks at the maximum a posteriori estimator, which is the maxi- 
mum of 

6 i ^ L e (xi, . . .,x n ) = fe{x\) fe(x n )p{9) , 

where p(6) dO was the a priori distribution on 0. 

Definition. The minimax principle is the aim to find 

min max R(0. T) . 

T 8 

The Bayes principle is the aim to find 

min / {R{6,T) dft(6) . 
T Je 

Example. Assume f$(x) — ^e~^ x ~ 6 ^. The maximum likelihood function 



£,(i 1 ,...,i n ) = ie-SilM 



is maximal when Y] ■ \xi — 9\ is minimal which means that t(xi, . . . , x n ) is 
the median of the data xi,...,x n . 
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Example. Assume fe(x) = 9 x e~ e jx\ is the probability density of the Pois- 
son distribution. The maximal likelihood function 

e £» log(0)xi-ne 

lg(xi, . . . ,x n ) 



X\\- ■ ■ x n \ 



is maximal for 9 = x i/ r 



Example. The maximum likelihood estimator for 9 = (m,cr 2 ) for Gaussian 

distributed random variables fg(a 
likelihood function maximized for 



— (;g — my 

distributed random variables fg(x) = , 1 = e ^ has the maximum 

J y ' V2TTCT 2 



t(xi,...,x n ) = (- y\xi,~ y~){xi -x) 2 ) . 

i i 

Definition. Define the Fisher information of a random variable X with 
density fg as 

1(0) = I (^f\) 2 fe(x) dx . 
J fe{x) 

If 9 is a vector, one defines the Fisher information matrix 



f'eje 
■In 



Lemma 5.3.3. 1(6) = Var e [&j 



Proof. E[^] = J n f' e dx = so that 



Var e [f]=E e [A 2 ] 
Jo je 



□ 



Lemma 5.3.4. 1(9) = -E e [(log(/ 9 ) 



Proof. Integration by parts gives: 

E[l0g(/ 9 )"] = J \0g{fg)"fg dx = - j \0g{fg)'f' g dx = - J (fg/fgffg dx . 

□ 
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Definition. The score function for a continuous random variable is defined 
as the logarithmic derivative pg = fg/fe- One has 1(8) = Eg[pg] = Varg[pg]. 

Example. If X is a Gaussian random variable, the score function pg = 
f'(9)/f(9) = — {x — m)/(er 2 ) is linear and has variance 1. The Fisher in- 
formation / is 1/ct 2 . We see that Var[X] = 1/1. This is a special case 
n = 1,T = X, 6 = m of the following bound: 



Theorem 5.3.5 (Rao-Cramer inequality). 

In the unbiased case, one has 

Err e [T] > 



71/(0) 



Proof. 1) 9 + B(6) = Eg[T] = J t(x\,. ■ . ,x n )Lg(xi, . . . ,x n ) dxi ■ ■ ■ dx n . 
2) 

1 + B'(9) = J t(x\, . . . , x n )L' 8 (xi, . ■ . ,x n ) dxi ■ ■ ■ dx„ 

/,, \ Lg(X\, . . . , X„) 
t{xi, . . . ,x n )— r dxi---dx n 
Lg(Xi,. ..,X n ) 

= E e [T-g] 

3) 1 = J Lg(x\, . . . , x n ) dxi ■ ■ ■ dx n implies 

= J L'g(x x , . . .,x n )/Lg(xu ...,X n ) = E[L' g /Lg] . 

4) Using 3) and 2) 

Cov[T,L'g/L 8 } = Eg[TL' e /L e ]-0 
= l + B'{8). 



5) 



(1 + B'{9)) 2 = Cov 2 [T,^] 

Lg 

< Var e [T]Var e [^] 
Lg 

= Var e [T] nl{6) , 
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where we used 4), the lemma and 



L' e /L e = ^2f e (x i )/f e (x i ) 



□ 



Definition. Closely related to the Fisher information is the already defined 
Shannon entropy of a random variable X: 



S(6) = -l felog(f e )dx 
as well as the power entropy 

v ; 27re 



Theorem 5.3.6 (Information Inequalities). If X, Y are independent random 
variables then the following inequalities hold: 

a) Fisher information inequality: Ix\y — + • 

b) Power entropy inequality: Nx+y > Nx + Ny- 

c) Uncertainty property: IxNx > 1- 

In all cases, equality holds if and only if the random variables are Gaussian. 



Proof, a) Ix+y < c 2 Ix + (1 — c) 2 Iy is proven using the Jensen inequal- 
ity (2.5.1). Take then c = I Y /(Ix + Iy)- 

b) and c) are exercises. □ 



Theorem 5.3.7 (Rao-Cramer bound). A random variable X with mean m 
and variance a 2 satisfies: Ix > 1/a 2 . Equality holds if and only if X is the 
Normal distribution. 



Proof. This is a special case of Rao-Cramer inequality, where 9 is fixed, 
n = 1. The bias is automatically zero. A direct computation giving also 
uniqueness: E[(aX + b)p(X)] = J (ax + b)f'(x) dx = —a J f(x) dx = —a 
implies 

< E[(p(X) + (X m)/a 2 ) 2 } 

= E[(p(X) 2 } + 2E[(X - m)p(X)\/a 2 + E[(X - mf/a 4 } 
< I x - 2/ct 2 + 1<7 2 . 



Equality holds if and only if px is linear, that is if X is normal. 



□ 
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We see that the normal distribution has the smallest Fisher information 
among all distributions with the same variance a 2 . 

5.4 Vlasov dynamics 

Vlasov dynamics generalizes Hamiltonian n-body particle dynamics. It deals 
with the evolution of the law P* of a discrete random vector X 1 . If P* is 
a discrete measure located on finitely many points, then it is the usual 
dynamics of n bodies which attract or repel each other. In general, the 
stochastic process X t describes the evolution of densities or the evolution 
of surfaces. It is an important feature of Vlasov theory that while the ran- 
dom variables X* stay smooth, their laws P* can develop singularities. This 
can be useful to model shocks. Due to the overlap of this section with geom- 
etry and dynamics, the notation slightly changes in this section. We write 
X 1 for the stochastic process for example and not X t as before. 

Definition. Let SI = M be a 2p-dimensional Euclidean space or torus with 
a probability measure m and let N be an Euclidean space of dimension 2q. 
Given a potential V : R q -4 R, the Vlasov flow X 1 = : M ->• N is 

defined by the differential equation 

./ =9,9 = - [ W(/(o>) - f( V )) dm( V ) . 

These equations are called the Hamiltonian equations of the Vlasov flow. 
We can interpret X 1 as a vector- valued stochastic process on the probability 
space (M, A, m). The probability space (M, A, m) labels the particles which 
move on the target space N. 

Example. If p = and M is a finite set O = {uji, . . . ,u> n }, then X 1 describes 
the evolution of n particles {fi,gi) = X(u>i). Vlasov dynamics is therefore 
a generalization of n-body dynamics. For example, if 

x 2 

v(xi,....x n ) = 22y' 

i 

then W(x) — x and the Vlasov Hamiltonian system 

./ = 9,g(u) = - /(w) - f(f)) dm{r)) 
Jm 

is equivalent to the n-body evolution 

ft = 9i 

n 

9i = ~^2(fi-fj)- 

3=1 

In a center of mass coordinate system where fi( x ) = 0: this simplifies 

to a system of coupled harmonic oscillators 
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Example. If N = M = M 2 and m is a measure, then the process X 1 
describes a volume-preserving deformation of the plane M. In other words, 
X 1 is a one-parameter family of volume-preserving diffcomorphisms in the 
plane. 



Figure. An exam-pie with M = 
N = R 2 , where the measure m 
is located on 2 points. The Vlasov 
evolution describes a deformation 
of the plane. The situation is 
shown at time t = 0. The coor- 
dinates (x, y) describe the position 
and the speed of the particles. 




Figure. The situation at time t = 
0.1. The two particles have evolved 
in the phase space N . Each point 
moves as "test particle" in the 
force field of the 2 particles. Even 
so the 2 body problem is inte- 
grable, its periodic motion acts like 
a "mixer" for the complicated evo- 
lution of the test particles. 



Example. Let M = N = M. 2 and assume that the measure m has its support 
on a smooth closed curve C. The process X* is again a volume-preserving 
deformation of the plane. It describes the evolution of a continuum of par- 
ticles on the curve. Dynamically, it can for example describe the evolution 
of a curve where each part of the curve interacts with each other part. The 
picture sequence below shows the evolution of a particle gas with support 
on a closed curve in phase space. The interaction potential is V(x) = e~ x . 
Because the curve at time t is the image of the diffeomorphism X*, it 
will never have self intersections. The curvature of the curve is expected 
to grow exponentially at many points. The deformation transformation 
X t — satisfies the differential equation 

9 

f e -(/(")"/(")) dm{ri) . 
Jm 

If r(s), s G [0, 1] is the parameterization of the curve C so that m(r[a, b]) = 



dt J 
d 

dt 9 
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dt 



<,-(/*(*)-/>«)) d s 



The evolved curve C* at time i is parameterized by s — >• (f t (r(s)),g t (r(s))). 




Figure. The support Figure. The support Figure. The support 
of the measure P° on of the measure P on of the measure P 1 2 on 
N = R. 2 . iV = R 2 . N = R 2 . 



Example. If X 1 is a stochastic process on (17 = M, A, m) with takes values 
in N, then P* is a probability measure on N defined by P*[A] = m(X~ 1 A) . 
It is called the push- forward measure or law of the random vector X. The 
measure P* is a measure in the phase space N. The Vlasov evolution defines 
a family of probability spaces (N, B, P'). The spatial particle density p is 
the law of the random variable x(x, y) = x. 



Example. Assume the measure P° is located on a curve r(s) = (s,sin(s)) 
and assume that there is no particle interaction at all: V = 0. Then P* is 
supported on a curve (s + sin(s), sin(s)). While the spatial particle density 
has initially a smooth density yT + cos(s) 2 , it becomes discontinuous after 
some time. 
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t=0 



t=2 



t=3 



Figure. Already for the free evo- 
lution of particles on a curve in 
phase space, the spatial particle 
density can become non-smooth 
after some time. 



. 5 



. 5 



-1 




Example. In the case of the quadratic potential V(x) = x 2 /2 assume m has 
a density p(x, y) = e~ x ~ 2y , then P* has the density p t (x, y) = f(xcos(t) + 
y sin(t), — x sin(t) + ycos(t)). To get from this density in the phase space, 
the spatial density of particles, we have to do integrate y out and do a 
conditional expectation. 



Lemma 5.4.1. (Maxwell) If X t = (/',(?*) is a solution of the Vlasov Hamil- 
tonian flow, then the law P* = (X')*m satisfies the Vlasov equation 



P\x, y)+y V x P*(z, y) - W(x) ■ V y P*(x, y) = 
with W{x) = J M V x V(x - x') ■ Y t {x' 1 y')) dy'dx'. 



Proof. We have / VV(/(w) — /(?/)) dm(rj) = W(f(oS)). Given a smooth 
function h on N of compact support, we calculate 



L = 




as follows: 
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d f 

L = — J h(x,y)P t (x,y) dxdy 
d f 

= di J dm (u) 



V x h(f(u,t),g(ui,t))g(uj,t) dm(uj) 



M 

V y h(f(u, t), g(oj, t)) f W(/H - f(r,)) dm(r,) dm(u) 

M JM 

V x h(x,y)yP t {x,y) dxdy - / P t (x,y)W y h{x,y) 

J N 

W(.t - x^P^x^y') dx'dy'dxdy 
h(x, ?/)V x P*(a;, y)y dxdy 

N 

h{x,y)W{x) ■ V y P*(a;,y) dxdy . 

□ 



y 

N 



Remark. The Vlasov equation is an example of an integro-differential equa- 
tion. The right hand side is an integral. In a short hand notation, the Vlasov 
equation is 

P + y ■ P x - W{x) ■ P y = , 
where W = V X V *P is the convolution of the force V X V with P. 

Example. V(x) = 0. Particles move freely. The Vlasov equation becomes 
the transport equation P(x,y,t) + y ■ \/ x P t (x,y) = which is in one di- 
mensions a partial differential equation u t + yu x = 0. It has solutions 
u(t, x, y) = u(u, x + ty). Restricting this function to y = x gives the Burg- 
ers equation ut + xu x = 0. 

Example. For a quadratic potential V(x) = x 2 , the Hamilton equations are 
f(u) = -(/(^) - f f(rj) dm{ri)) . 

JM 

In center-of- mass-coordinates / = f — E[f], the system is a decoupled 
system of a continuum of oscillators / = g, g = —f with solutions 

f(t) = /(0) cos(t) + , 9 (0) sin(t), g(t) = -/(0) sin(i) + g(0) cos(t) . 

The evolution for the density P is the partial differential equation 

-P*(x, y) + y ■ V x P'(x, y) — x - V v P*(ar, y) = 

written in short hand as u t +y-u x — x-u y = 0, which has the explicit solution 
P t (x, y) = P°(cos(t)x + sm(t)y, — sm(t)x + cos(i)y). It is an example of a 
Hamilton-Jacobi equation. 
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Example. On any Riemannian manifold with Laplace-Beltrami operator 
A, there are natural potentials: the Poisson equation A(j> = p is solved by 
<j> = V * p, where 7k- is the convolution. This defines Newton potentials on 
the manifold. Here are some examples: 



• N = 


R: V(x) = 


|xj 
2 ■ 


• N = 


T: V(x) = 
S 2 . V(x) = 


|x(27T-x)| 
47T 


• N = 


log(l — x ■ x). 


• N = 


R 2 V(x) = 




• N = 


R 3 V(x) = 


JttJx]' 


• N = 


R 4 V(x) = 


1 1 

8ir \x\ 2 ■ 



For example, for N = R, the Laplacian A/ = /" is the second deriva- 
tive. It is diagonal in Fourier space: A/(fc) = —k 2 f, where tel. From 
Deltaf(k) = -k 2 f = p(k) we get f(k) = -(l/k 2 )p\k), so that / = V * p, 
where V is the function which has the Fourier transform V(k) = — 1/fc 2 . 
But V(x) = | a; |/2 has this Fourier transform: 




Also for N = T, the Laplacian A/ = /" is diagonal in Fourier space. It 
is the 27r-periodic function V(x) = x(2ir — x)/(4ir), which has the Fourier 
series V(k) = —1/fc 2 . 

For general N = R" , see for example [60] 

Remark. The function G y (x) = V(x — y) is also called the Green function 
of the Laplacian. Because Newton potentials V are not smooth, establishing 
global existence for the Vlasov dynamics is not easy but it has been done 
in many cases [31]. The potential \x\ models galaxy motion and appears in 
plasma dynamics [94, 67, 85]. 



Lemma 5.4.2. (Gronwall) If a function u satisfies u'(t) < \g(t)\u(t) for all 
< t < T, then u(t) < u(0) exp( / * \g(s) \ ds) for < t < T. 



Proof. Integrating the assumption gives u(t) < u(0) + J* g(s)u(s) ds. The 
function h(t) satisfying the differential equation h'(t) = \g(t)\u(t) satisfies 
h'(t) < \g(t)\h(t). This leads to h{t) < h(0)exp(f* \g{s)\ ds) so that u(t) < 
u(0) exp( J* \g(s)\ ds). This proof for real valued functions [20] generalizes 
to the case, where u l {x) evolves in a function space. One just can apply the 
same proof for any fixed x. □ 
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Theorem 5.4.3 (Batt-Neunzert-Brown-Hepp-Dobrushin). If V X V is 
bounded and globally Lipshitz continuous, then the Hamiltonian Vlasov 
flow has a unique global solution X 1 and consequently, the Vlasov equa- 
tion has a unique and global solution P t in the space of measures. If V and 
P° are smooth, then P* is piecewise smooth. 



Proof. The Hamiltonian differential equation for X = (/, g) evolves on on 
the complete metric space of all continuous maps from M to N. The dis- 
tance is d(X,Y) = sup wgM d(X(u>), Y(uj)), where d is the distance in TV. 

We have to show that the differential equation / = g and g = G(f) = 
— J M V x V(f(u) — f(rj)) dm(rf) in C(M, N) has a unique solution: because 
of Lipshitz continuity 

l|G(/) - G(/')|U < 2||£>(V SB T0|| oo • ||/ - /'|U 

the standard Piccard existence theorem for differential equations assures 
local existence of solutions. 



The Gronwall's lemma assures that ||X(w)|| can not grow faster than ex- 
ponentially. This gives the global existence. □ 

Remark. If m is a point measure supported on finitely many points, then 
one could also invoke the global existence theorem for differential equations. 
For smooth potentials, the dynamics depends continuously on the measure 
m. One could approximate a smooth measure m by point measures. 

Definition. The evolution of DX* at a point lj £ M is called the linearized 
Vlasov flow. It is the differential equation 

Df(u) = - f V 2 n/M - f(v)) dm{rf)Df{w) =: B(f*)Df(u) 

JM 

and we can write it as a first order differential equation 



dt 



d_ 

dt 



f 
9 



Im -VM/H - f(rj)) dm(r)) 









. g _ 



f 
9 



Remark. The rank of the matrix DX*^) stays constant. is a lin- 

ear combination of Df°(uj) and Dg°(uj). Critical points of /* can only 
appear for w, where D f° (lj) , D f° (w) are linearly dependent. More gen- 
erally Yfe(i) = {lj £ M | DX\lj) has rank 2q - k = dim(N) - k} is time 
independent. The set Y q contains {lj | D(/)(lj) = \D(g)(u), A £ RU{oo}}. 
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A(cj) = limsup - log(||D(X t (cj))||) g [0,oo] 

t->oo t 

is called the maximal Lyapunov exponent of the SL(2q, R)-cocyclc A* = 
A(f t ) along an orbit X f = (/*, g t ) of the Vlasov flow. The Lyapunov expo- 
nent could be infinite. Differentiation of Df = £?(/')/* at a critical point 
w* gives Z? 2 /*(w*) = B(f t )D 2 f t (uj t ). The eigenvalues \j of the Hessian 
D 2 f satisfy A, = B{f)\j. 

Definition. Time independent solutions of the Vlasov equation are called 
equilibrium measures or stationary solutions. 



Definition. One can construct some of them with a Maxwellian ansatz 

P(x, 2 /) = Cexp(-/3(^ + j V(x - x')Q{x') dx)) = S(y)Q(x) , 

The constant C is chosen such that J Rd S(y) dy = 1. These measures are 
called Bernstein-Green-Kruskal (BGK) modes. 



Proposition 5.4.4. If Q : N M> R satisfies the integral equation 

Q(x) = exp(- / /3V(x ~ x')Q(x')) dx' = cxp(-/3F * Q(x)) 

then the Maxwellian distribution P(x,y) = S(y)Q(x) is an equilibrium 
solution of the Vlasov equation to the potential V. 



Proof. 

yV x P = yS(y)Q x (x) 

= yS(y)(-PQ(x) f W x V(x - x')Q(x') dx') 

and 

/ V x V(x - x')V y P(x, y)P(x', y') dx' dy' 

JN 

= Q(x)(-(3S(y)y) J V x V{x - x')Q{x') dx' 
gives yV x P{x, y) = J N V x V{x - x')V y P(x, y)P(x',y') dx' dy' . □ 
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5.5 Multidimensional distributions 



Random variables which are vector-valued can be treated in an analogous 
way as random variables. One often adds the term "multivariate" to indi- 
cate that one has multiple dimensions. 

Definition. A random vector is a vector-valued random variable. It is in 
C p if each coordinate is in C p . The expectation E[A] of a random vector 
A = (Xi,...,Xd) is the vector (E[Xi], . . . , E[A<J), the variance is the 
vector (Var[Ai], . . . , Var[Aj). 

Example. The random vector X = (x 3 , y 4 , z 5 ) on the unit cube ft = [0, l] 3 
with Lebesgue measure P has the expectation E[A] = (1/4, 1/5, 1/6). 

Definition. Assume X = (X\, . . . , Xd) is a random vector in £°°. The law 
of the random vector X is a measure fr on K d with compact support. After 
some scaling and translation we can assume that [i be a bounded Borel 
measure on the unit cube I d = [0, l] d . 



Definition. The multi-dimensional distribution function of a random vector 
X = (Xi, . . . , Xd) is defined as 

Fx(t) =F {Xl _ Xd) {t l7 ..,t d ) =P[Xi <h,...,X d <t d ] . 

For a continuous random variable, there is a density fx(t) satisfying 

/*i rid 
... f(sx,..,s d ) dsi ---dsd . 
-oo J — oo 

The multi-dimensional distribution function is also called multivariate dis- 
tribution function. 



Definition. We use in this section the multi-index notation x n = Yli=i x 7 l - 
Denote by fi n = J Jd x n dfi the n'th moment of /i. If A is a random 
vector, with law fi, call ^t„(X) the n'th moment of A. It is equal to 
E[A n ] = EfA™ 1 A™ 2 ■ ■ ■ A" d ]. We call the map n £^4 u„ the moment 
configuration or, if d = 1, the moment sequence. We will tacitly assume 
/i n = 0, if at least one coordinate ni in n = (m, . . . , rid) is negative. 
If A is a continuous random vector, the moments satisfy 



Hn(X)= / x n f(x)dx 
which is a short hand notation for 

x i d ■ ••Xn d f(xi, -,x d ) dxi ■ --dx d 
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Example. The n = (7, 3, 4)'th moment of the random vector X = (x 3 , y 4 , z 5 ) 
is 

E\X: il X? 2 X" 3 ] = E\x 21 y 12 z 2Q ] = — — — . 

J 22 13 20 

The random vector X is continuous and has the probability density 

,.-2/3 ,,-3/4 ~-4/5 

/(x j1/ ,«) = (H 1 -)(1L_)(£__ ) . 

Remark. As in one dimension, one can define a multidimensional moment 
generating function 

M x (t) = E[e*'*] = E[e tlXl e t2X2 ■ ■ ■ e tdXd ] 

which contains all the information about the moments because of the multi- 
dimensional moment formula 

f d n A/F 

E[X n ] j= / x ?dn=—±(t)\ t = . 
where the n'th derivative is defined as 

^ Qni gn 2 Qn d 

Example. The random variable X = (x, y/y, z 1 / 3 ) has the moment gener- 
ating function 

M(s,t,u) = [ [ [ e sx+t ^y +uzl/3 dxdydz 
Jo Jo Jo 

(e s - 1) 2 + 2e*(i-l) -6 + 3e"(2 - 2u + u 2 ) 
s t 2 u 3 

Because the components X±, X2, X3 in this example were independent ran- 
dom variables, the moment generating function is of the form 

M{s)M(t)M(u) , 

where the factors are the one-dimensional moments of the one-dimensional 
random variables X\ , X2 and X3 . 

Definition. Let be the standard basis in 7L d . Define the partial difference 

(Aj<z)„ = o n _ ei — a n on configurations and write A k = FJ^ Unlike the 
usual convention, we take a particular sign convention for A. This allows 
us to avoid many negative signs in this section. By induction in ^2i=i n ii 
one proves the relation 



(A*/i)«= / x n - k {l-x) k dfi (5.1) 

J Id 



using x n ~ ei ~ k (l—x) k —x n ~ k (l—x) k = x n ~ ei ~ k (l— x) k+Ci . To improve read- 
ability, we also use notation like £ = or ( ^ ^ = LliLi ^ jfc* J 

or X)fc=o = X)fei=o ' ' ' Sfcd=o- We mean n — > co in the sense that rii — > 00 
for alH = 1 . . . d. 



316 



Chapter 5. Selected Topics 



Definition. Given a continuous function / : I d — > R. For n <G W l , rn > we 
define the higher dimensional Bernstein polynomials 

*c/>m -±/<£.- ■■■!£>( 2 )«*a -«)-*■ 



Lemma 5.5.1. (Multidimensional Bernstein) In the uniform topology in 
C(I d ), we have B n {f) — > / if n — > oo. 



Proof. By the Weierstrass theorem, multi-dimensional polynomials are dense 
in C(I d ) as they separate points in C(I d ). It is therefore enough to prove 
the claim for f(x) = x rn = Yli=i X T* ■ Because B n (y m )(x) is the product of 
one dimensional Bernstein polynomials 

d 

B n (y m )(x) = l[B ni (y?*)(x i ), 

i=l 

the claim follows from the result corollary (2.6.2) in one dimensions. □ 



Remark. Hildcbrandt and Schoenberg refer for the proof of lemma (5.5.1) 
to Bernstein's proof in one dimension. While a higher dimensional adapta- 
tion of the probabilistic proof could be done involving a stochastic process 
in Z d with drift Xi in the i'th direction, the factorization argument is more 
elegant. 



Theorem 5.5.2 (Hausdorff,Hildcbrandt-Schocnberg). There is a bijection 
between signed bounded Borel measures /i on [0, l) d and configurations /j, n 
for which there exists a constant C such that 

J2 I ( fc) (A fe M) n | < C, Vn e N d . (5.2) 

A configuration fj, n belongs to a positive measure if and only if additionally 
to (5.2) one has (A fc /i)„ > for all k, n G N rf . 



Proof, (i) Because by lemma (5.5.1), polynomials are dense in C(I d ), there 
exists a unique solution to the moment problem. We show now existence 
of a measure fi under condition (5.2). For a measures //, define for n E N d 
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the atomic measures on I d which have weights ( ™ ^ (A fc /x)„ on the 
llf=i( n i + !) P° ints (^l^r 1 - • • • > ! H^r i ) e /d with < < n t- Because 

/ x™d^{x) = £ ( £ ) (^r(AV)» 

- / d E(l)(^r^ fc (i-)^M(,) 

= / B n (y m )(x) dn(x) -> / x m d/i(x) , 

we know that any signed measure \x which is an accumulation point of n^ 1 , 
where m — > oo solves the moment problem. The condition (5.2) implies that 
the variation of the measures uy 1 ' is bounded. By Alaoglu's theorem, there 
exists an accumulation point /z. 

(ii) The left hand side of (5.2) is the variation ||/z ( '™- ) || of the measure u\ n \ 
Because by (i) us n } — > /Lt, and /i has finite variation, there exists a constant 
C such that ||/i (n) || < C for all n. This establishes (5.2). 

(iii) We see that if (A k fj,) n > for all k, then the measures are all 
positive and therefore also the measure ii. 

(iv) If ii is a positive measure, then by (5.1) 

( I ) (AV)n = ( I ) x n ~ k (l - x) k dfx(x) > . 

□ 

Remark. Hildcbrandt and Schoenberg noted in 1933, that this result gives 
a constructive proof of the Riesz representation theorem stating that the 
dual of C(I d ) is the space of Borel measures M(I d ). 

Definition. Let 6(x) denote the Dirac point measure located on a; € I d . It 
satisfies J Jd 5(x) dy = x. 

We extract from the proof of theorem (5.5.2) the construction: 



Corollary 5.5.3. An explicit finite constructive approximations of a given 
measure ii on I d is given for n £ N d by the atomic measures 

(n) ( n \/Ak \ c,, n l~ki nd — kds, 

0<fc;<n, V 7 
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Hausdorff established a criterion for absolutely continuity of a measure p 
with respect to the Lebesgue measure on [0, 1] [74]. This can be generalized 
to find a criterion for comparing two arbitrary measures and works in d 
dimensions. 

Definition. As usual, we call a measure p on I d uniformly absolutely con- 
tinuous with respect to v, if it satisfies p = / dv with / € L°°(I d ). 



Corollary 5.5.4. A positive probability measure p is uniformly absolutely 
continuous with respect to a second probability measure v if and only if 
there exists a constant C such that (A k p) n < C ■ (A fc i/)„ for all k, n G N d . 



Proof. If p = fv with / e L°°(I d ), we get using (5.1) 



(A*a*)» = / x n ~ k (l-x) k dp(x) 

J Id 

i—k 



x n -*(l -xf fdv{x) 



< 1 1/| loo / X n ~ k (l~x) k du(x) 
J I d 

= h/iuca^. 

On the other hand, if (A fc pt)„ < C(A fe ^)„ then p n = C{A k v) n - (A fc ^)„ 
defines by theorem (5.5.2) a positive measure p on I d . Since p = Cv — p, 
we have for any Borel set A C I d p(A) > 0. This gives p(A) < Cv{A) and 
implies that p is absolutely continuous with respect to v with a function / 
satisfying f(x) < C almost everywhere. □ 

This leads to a higher dimensional generalization of Hausdorff's result 
which allows to characterize the continuity of a multidimensional random 
vector from its moments: 



Corollary 5.5.5. A Borel probability measure p on I d is uniformly abso- 
lutely continuous with respect to Lebesgue measure on I d if and only if 

|AVn| <( l) ntiOi + 1) ^ all k and n. 



Proof. Use corollary (5.5.4) and J Id x n dx = JJ. f £ J H-fa + 1). □ 

There is also a characterization of Hausdorff of L p measures on I 1 = [0, 1] 
for p > 2. This has an obvious generalization to d dimensions: 
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Proposition 5.5.6. Given a bounded positive probability measure [i £ 
M{I d ) and assume 1 < p < oo. Then fi <E L p {I d ) if and only if there 
exists a constant C such that for all k, n 

(n + ir^A^Y I V <C. (5.3) 



Proof, (i) Let be the measures of corollary (5.5.3). We construct first 
from the atomic measures [V- n > absolutely continuous measures py 1 ' = 
g^dx on L d given by a function g which takes the constant value 

(l A "(M)n|( I )ylkn i + iy 
^ ' »=i 

on a cube of side lengths l/(n; + 1) centered at the point (n — k)/n E I . 
Because the cube has Lebesgue volume (n + 1) _1 = Ili=i( n i + I) -1 ' it nas 
the same measure with respect to both fj,^ and g^dx. We have therefore 
also g^dx — > fx weakly. 

(ii) Assume [i = fdx with / <G LP . Because g( n 'dx — > fdx in the weak 
topology for measures, we have g^ n ' — > f weakly in LP . But then, there 
exists a constant C such that ||</ n )|| p < C and this is equivalent to (5.3). 

(iii) On the other hand, assumption (5.3) means that \\g {n) \\ p < C, where 
gr( n ) was constructed in (i). Since the unit-ball in the reflexive Banach space 
L p {L d ) is weakly compact for p € (0, 1), a subsequence of g^ converges to 
a function g <E L p . This implies that a subsequence of g^dx converges as 
a measure to gdx which is in LP and which is equal to fi by the uniqueness 
of the moment problem (Weierstrass) . □ 



5.6 Poisson processes 

Definition. A Poisson process (S, P, H, N) over a probability space (0, F, Q) 
is given by a complete metric space S, a non-atomic finite Borel measure 
P on S and a function u i— > IT(w) C 5 from to the set of finite subsets of 
S 1 such that for every measurable set B C S, the map 

W ^iV fl (a;) = pM||n(a;)nS| 

is a Poisson distributed random variable with parameter P[B]. For any 
finite partition of S, the set of random variables {Nb^—i have to 

be independent. The measure P is called the mean measure of the process. 
Here \A\ denotes the cardinality of a finite set A. It is understood that 
N b (uj) = if lu e S° = {0}. 
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Example. We have encountered the one-dimensional Poisson process in 
the last chapter as a martingale. We started with IID Poisson distributed 
random variables Xj~ which arc "waiting times" and defined N t (uj) = 
SfeLi lsfc(o;)<t- Lets translate this into the current framework. The set S 
is [0,t] with Lebesgue measure P as mean measure. The set n(w) is the 
discrete point set U(lj) = {S n (oj) | n — 1, 2, 3, . . . } n S. For every Borel set 
B in S, we have 

inHnfll 
phi • 

Remark. The Poisson process is an example of a point process, because 
we can see it as assigning a random point set II (a; ) on S which has density 
P on S. If S is part of the Euclidean space and the mean measure P is 
continuous P = fdx, then the interpretation is that f(x) is the average 
density of points at x. 



Figure. A Poisson process in M 2 " .'. .' ".. '• 

with mean density '■ '•"««*. 

dxdy . . . • '• > 



2tt 



Theorem 5.6.1 (Existence of Poisson processes). For every non-atomic mea- 
sure P on S, there exists a Poisson process. 



Proof. Define f2 = UdLo '-' > where S d = Sx ■ ■ ■ x S is the Cartesian product 
and S a = {0}. Let T be the Borel a-algebra on fi. The probability measure 
Q restricted to S d is the product measure (P X P X • • • X P) • Q [Ns = d] , where 
Q[N S = d] = Q[S d ] = e- p ^(dV)- 1 P[S] d . Define E(w) = {wi,...,w d } if 
u) e S d and Nb as above. One readily checks that (S, P, II, N) is a Poisson 
process on the probability space (f2, J 7 , Q): For any measurable partition 
{Bj}f =0 of 5, we have 

q[n Bi = i , . . . , jv Bm - d m i ^ = do+E d > = d \= dol .:. d , n 



5.6. Poisson processes 



321 



so that the independence of {Nb^JLi follows: 

oo 

Q[N Bl =d 1 ,...,N Bm = d m ] = J2 Q[ N s = d ] QWb x 

d=di-\ \rd m 

= di,...,N Bm =d m \N s = d] 

d=d!+---+d m U m j = 

~ e- p [B a ] P[Bo \do " e - p ^P[B^ 

w„i m 



do! J11 d,! 



3=1 

77 1 



= HQlNs^dj}. 

i=i 

This calculation in the case m = 1, leaving away the last step shows that iVs 
is Poisson distributed with parameter P[B>]. The last step in the calculation 
is then justified. □ 

Remark. The random discrete measure P{w)[B] = Nb{u) is a normal- 
ized counting measure on S with support on n(u;). The expectation of 
the random measure P(uj) is the measure P on S defined by P[B) = 
J n P(uj)[B] dQ{uS). But this measure is just P: 



Lemma 5.6.2. P = f Q P(u) dQ(u) = P. 



Proof. Because the Poisson distributed random variable Nb(u>) = P(lo)[B] 
has by assumption the Q-expectation P[B] = J2T=o^ QWb = k] = 
f n P{uj)[B] dQ{u) one gets P = f n P(ui) dQ{u) = P. □ 

Remark. The existence of Poisson processes can also be established by 
assigning to a basis {e 2 ; } of the Hilbert space L 2 (S, P) some independent 
Poisson-distributed random variables Zi = 4>{e.i) and define then a map 
4>{f) = '^2i a i4'i e i) if / = ^2i a i£i- The image of this map is a Hilbert 
space of random variables with dot product Cov[^(/), <j>(g)] = (/, g). Define 
Nb = 4>(1b)- These random variables have the correct distribution and are 
uncorrelated for disjoint sets Bj. 

Definition. A point process is a map II a probability space (f2, F, Q) to 
the set of finite subsets of a probability space (S, B, P) such that N b (uj) := 
\u n B\ is a random variable for all measurable sets B G B. 
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Definition. Assume II is a point process on (S,B,P). For a function / : 
S — > IR + in L 1 (S', P), define the random variable 

S/H= E /(*)• 

Example. For a Poisson process and f = 1b, one gets = Nb{ui). 

Definition. The moment generating function of E/ is defined as for any 
random variable as 

M S/ (f)=E[e tE '] . 
It is called the characteristic functional of the point process. 



Example. For a Poisson process and / = als, the moment generating 
function of E/(w) = iV B (w) is E[e atWB ] = e^K 1 -^*) . We have computed 
the moment generating function of a Poisson distributed random variable 
in the first chapter. 

Example. For a Poisson process and / = J2k=i a j^B k , where Bk are disjoint 
sets, we have the characteristic functional 

n 

Example. For a Poisson process, and / £ L 1 (S', P), the moment generating 
function of £/ is 

M E/ (i) = exp(- / (1 - exp(t/(z))) dP(*)) . 

This is called Campbell's theorem. The proof is done by writing / = 
/ + — / — , where both / + and /~ are nonnegativc, then approximating 
both functions with step functions f£ = ^2jCi^l B + = J2j fkj an d fk = 
J2j a J^B~ fkj- Because for Poisson process, the random variables £^± 
are independent for different j or different sign, the moment generating 
function of £/ is the product of the moment generating functions E ,± = 




The next theorem of Alfred Renyi (1921-1970) gives a handy tool to check 
whether a point process, a random variable II with values in the set of 
finite subsets of S, defines a Poisson process. 



Definition. A fc-cube in an open subset S of M. d is is a set 

n d rn (m + l) 
^ 2 fc' 2 fe ; ' 

i=l 
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Theorem 5.6.3 (Rcnyi's theorem, 1967). Let P be a non-atomic probability 
measure on (S, B) and let II be a point process on (O, T , Q). Assume for any 
finite union of fc-cubes B a S, Q[N B = 0] = exp(-P[B]). Then (S, P, II, N) 
is a Poisson process with mean measure P. 



Proof, (i) Define O(B) = {uj £ f2 | Nb(u) = 0}c!1 for any measurable 
set S in 5. By assumption, Q[0(S)] = exp(-P[B]). 

(ii) For m disjoint fc-cubes {Bj}JL 1 , the sets O(Bj) C f2 are independent. 
Proof: 

m 

QiCl Whl = Q[{NusL lBj = o}] 

= exp(-P[|J B,-]) 

m 

= HQ[0(Bj)}. 

.7=1 

(iii) We count the number of points in an open open subset U of 5 using 
fc-cubes: define for fc > the random variable Njj(u>) as the number fc- 
cubes B for which u <G 0(B n {/). These random variable N^(uj) converge 
to Nu(ui) for fc — > oo, for almost all w. 

(iv) For an open set f, the random variable Njj is Poisson distributed 
with parameter P[U]. Proof: we compute its moment generating function. 
Because for different fc-cubes, the sets O(Bj) C 0(U) are independent, 
the moment generating function of Njj = ^ fc lo(B)j) is the product of the 
moment generating functions of ^o(B)j)'- 

E[e m »} = J] {Q[0{B)]+e\l-Q[0{B)])) 

fc— cube B 

JJ (cxp(-P[B]) + e t (l-exp(-P[B]))) . 

fc— cube i? 

Each factor of this product is positive and the monotone convergence the- 
orem shows that the moment generating function of Njj is 

E[e tNu ] = lim TT (cxp(-P[B]) + e*(l-exp(-P[B]))) . 

fc— >-00 A A 

fc— cube 
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which converges to exp(P[[/](l — e*)) for k — > oo if the measure P is non- 
atomic. 

Because the generating function determines the distribution of Njj, this 
assures that the random variables Njj are Poisson distributed with param- 
eter P[U]. 

(v) For any disjoint open sets U\, . . . , U m , the random variables {Nu j )}jL 1 
are independent. Proof: the random variables {Njj .)} 1 J!l 1 are independent 
for large enough k, because no fc-cube can be in more than one of the sets 
Uj, The random variables are then independent for fixed k. Let- 
ting k — ¥ oo shows that the variables Njj j are independent. 

(vi) To extend (iv) and (v) from open sets to arbitrary Borel sets, one 
can use the characterization of a Poisson process by its moment generating 
function of / e L 1 (S,'P). If / = J2 a i^Uj for disjoint open sets Uj and 
real numbers aj, we have seen that the characteristic functional is the 
characteristic functional of a Poisson process. For general / e HS,P) the 
characteristic functional is the one of a Poisson process by approximation 
and the Lebesgue dominated convergence theorem (2.4.3). Use f = 1b to 
verify that Nb is Poisson distributed and / = ^2 a-i^B with disjoint Borel 
sets Bj to see that {Nbj )}]Li are independent. □ 

5.7 Random maps 

Definition. Let (CI, A, P) be a probability space and M be a manifold with 
Borel cr-algebra B. A random difieomorphism on M is a measurable map 
from M x CI M so that x i-> f{x,uj) is a difieomorphism for all ui € CI. 
Given a V measure preserving transformation T on CI, it defines a cocycle 

S(x,u>) = (f(x,u),T{u)) 
which is a map on M x CI. 

Example. If M is the circle and f{x, c) = x + csin(a;) is a circle difieomor- 
phism, we can iterate this map and assume, the parameter c is given by 
IID random variables which change in each iteration. We can model this 
by taking {Q,A,P) = ([0, 1] N , £ N , ^ N ) where v is a measure on [0,1] and 
take the shift T(x n ) = x n+ \ and to define 

S(x,u) = (f(x,u) ),T(w)) . 

Iterating this random logistic map is done by taking IID random variables 
c„ with law v and then iterate 



X ,X! = f(x ,co),x 2 = f(x!,ci) 
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Example. If (£l,A, P,T) is an ergodic dynamical system, and A : il — > 
SL(d, R) is measurable map with values in the special linear group SL(d, R) 
of all d x d matrices with determinant 1. With M = R d , the random 
diffcomorphism f(x, v) = A(x)v is called a matrix cocycle. One often uses 
the notation 

A n (x) = A{T n - l {x)) ■ A(T n - 2 (x)) ■ ■ ■ A(T(x)) ■ A(x) 
for the n'th iterate of this random map. 

Example. If M is a finite set {1, .., n} and P = P^ is a Markov transition 
matrix, a matrix with entries P^j > and for which the sum of the column 
elements is 1 in each column. A random map for which f(xi,uj) = Xj with 
probability Py is called a finite Markov chain. 

Random diffeomorphisms are examples of Markov chains as covered in Sec- 
tion (3.14) of the chapter on discrete stochastic processes: 



Lemma 5.7.1. a) Any random map defines transition probability functions 
P:tfx6-> [0,1]: 

V{x,B)=Y[f{x,u)&B). 

b) If A n is a filtration of er-algcbras and X n (uj) = T n (uj) is A n adapted, 
then V is a discrete Markov process. 



Proof, a) We have to check that for all x, the measure V(x, •) is a prob- 
ability measure on M. This is easily be done by checking all the axioms. 
We further have to verify that for all B E B, the map x — > V(x,B) is 
,B-measurable. This is the case because / is a diffcomorphism and so con- 
tinuous and especially measurable. 

b) is the definition of a discrete Markov process. □ 

Example. If fl = (A N , J 71 * 1 , i/ N ) and T(x) is the shift, then the random map 
defines a discrete Markov process. 

Definition. In case, we get IID A-valued random variables X n = T n (x)o- 
A random map f(x,uj) defines so a IID diffeomorphism-valued random 
variables fi(x)(ui) = f(x,Xi(uj)),f2(x) = f(x,X2(ui)). We will call a ran- 
dom diffcomorphism in this case an IID random diffeomorphism. If the 
transition probability measures are continuous, then the random diffeomor- 
phism is called a continuous IID random diffeomorphism. If f{x, w) depends 
smoothly on u> and the transition probability measures are smooth, then 
the random diffcomorphism is called a smooth IID random diffeomorphism. 
It is important to note that "continuous" and "smooth" in this definition is 
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only with respect to the transition probabilities that A must have at least 
dimension d > 1. With respect to M, we have already assumed smoothness 
from the beginning. 

Definition. A measure [i on M is called a stationary measure for the random 
diffcomorphism if the measure /i x P is invariant under the map S. 

Remark. If the random diffcomorphism defines a Markov process, the sta- 
tionary measure fi is a stationary measure of the Markov process. 

Example. If every diffcomorphism x —> f{x,ui) from u> G O preserves a 
measure /x, then \i is a automatically a stationary measure. 

Example. Let M = T 2 = M 2 /Z 2 denote the two-dimensional torus. It is a 
group with addition modulo 1 in each coordinate. Given an IID random 
map: 



Each map either rotates the point by the vector a — {0.1,0.%) or by the 
vector f3 = , /?2 ) ■ The Lebesgue measure on T 2 is invariant because 
it is invariant for each of the two transformations. If a and /3 are both 
rational vectors, then there are infinitely many ergodic invariant measures. 
For example, if a = (3/7, 2/7), (3 = (1/11,5/11) then the 77 rectangles 
[i/7, (i + l)/7] x [j/11, (j + 1)/H] are permuted by both transformations. 

Definition. A stationary measure fi of a random diffcomorphism is called 
ergodic, if fi x P is an ergodic invariant measure for the map S on [M x 
!l,/ix P). 

Remark. If fx is a stationary invariant measure, one has 



for every Borcl set A € A. We have earlier written this as a fixed point 
equation for the Markov operator V acting on measures: Vfi = /i. In the 
context of random maps, the Markov operator is also called a transfer 
operator. 

Remark. Ergodicity especially means that the transformation T on the 
"base probability space" {Q,A, P) is ergodic. 

Definition. The support of a measure /1 is the complement of the open set 
of points x for which there is a neighborhood U with fJ.(U) = 0. It is by 
definition a closed set. 

The previous example 2) shows that there can be infinitely many ergodic in- 
variant measures of a random diffcomorphism. But for smooth IID random 
diffcomorphisms, one has only finitely many, if the manifold is compact: 
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Theorem 5.7.2 (Finitely many ergodic stationary measures (Doob)). If M 
is compact, a smooth IID random diffcomorphism has finitely many ergodic 
stationary measures /Xj. Their supports are mutually disjoint and separated 
by open sets. 



Proof, (i) Let Hi and /12 be two ergodic invariant measures. Denote by Si 
and £2 their support. Assume Sj and £2 are not disjoint. Then there ex- 
ist points Xi £ T,i and open sets Ui of Xi so that the transition probability 
P(x\, U2) is positive. This uses the assumption that the transition probabil- 
ities have smooth densities. But then /^(E/ xfi) = and /j,2(S(U x O)) > 
violating the measure preserving property of S. 

(ii) Assume there are infinitely many ergodic invariant measures, there 
exist at least countably many. We can enumerate them as /ii, /12, ■•■ Denote 
by Y>i their supports. Choose a point yi in E^. The sequence of points 
has an accumulation point y £ M by compactness of M. This implies 
that an arbitrary e- neighborhood U of y intersects with infinitely many . 
Again, the smoothness assumption of the transition probabilities P(y,-) 
contradicts with the S invariance of the measures fa having supports £j. 

□ 

Remark. If /ii, \xi are stationary probability measures, then A/ii + (1 — A)/i2 
is an other stationary probability measure. This theorem implies that the 
set of stationary probability measures forms a closed convex simplex with 
finitely many corners. It is an example of a Choquet simplex. 



5.8 Circular random variables 

Definition. A measurable function from a probability space (fl,A, P) to 
the circle (T, B) with Borel c-algcbra B is is called a circle- valued random 
variable. It is an example of a directional random variable. We can realize 
the circle as T = [-n, tt) or T = [0, 2tt) = K/(2ttZ). 

Example. If (ft, A, P) = (R, A, e~ x2/2 /V2ndx, then X(x) = x mod 2tt is a 
circle- valued random variable. In general, for any real- valued random vari- 
able Y , the random variable X(x) = X mod 2-k is a circle- valued random 
variable. 

Example. For a positive integer k, the first significant digit is X(k) — 
27rlog 10 (fc) mod 1. It is a circle- valued random variable on every finite 
probability space (CI = {1, . . . , n }, A, P[{fc}] = !/")• 
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Example. A dice takes values in 0, 1, 2, 3, 4, 5 (count 6 = 0). We roll it two 
times, but instead of adding up the results X and Y, we add them up 
modulo 6. For example, if X = 4 and Y = 3, then X + Y = 1. Note that 
E[X + Y] = E[X] ^ E[X] + E[Y}. Even if X is an unfair dice and if Y is 
fair, then X + Y is a fair dice. 

Definition. The law of a circular random variable X is the push-forward 
measure p = X*P on the circle T. If the law is absolutely continuous, it 
has a probability density function fx on the circle and p = fx{x)dx. As 
on the real line the Lebesgue decomposition theorem (2.12.2) assures that 
every measure on the circle can be decomposed p = p pp + p ac + p sc , where 
Vpp is (PP): Msc is (sc) and p ac is (ac). 

Example. The law of the wrapped normal distribution in the first example 
is a measure on the circle with a smooth density 

oo 

fx(x)= J2 e-(* +2 ^ 2 / 2 /V27. 

k— — oo 

It is an example of a wrapped normal distribution. 

Example. The law of the first significant digit random variable X n (k) = 
27rlog 10 (fc) mod 1 defined on {1, . . . , n } is a discrete measure, supported 
on {fc27r/10|0 < k < 10 }. It is an example of a lattice distribution. 

Definition. The entropy of a circle-valued random variable X with prob- 
ability density function fx is defined as H{f) = — f(x) log(/(x)) dx. 
The relative entropy for two densities is defined as 

H{f\g)= f(x)\og(f(x)/g(x))dx. 
Jo 

The Gibbs inequality lemma (2.15.1) assures that H(f\g) > and that 
H{f\g) = 0, if / = g almost everywhere. 

Definition. The mean direction m and resultant length p of a circular 
random variable taking values in {\z\ = 1} C C are defined as 

pe lm = E[e lX ] . 

One can write p = E[cos(X — m)]. The circular variance is defined as 
V = 1 - p = E[l - cos(A - m)] = E[(A - m) 2 /2 - (X - m) 4 /4!...]. 
The later expansion shows the relation with the variance in the case of 
real- valued random variables. The circular variance is a number in [0, 1]. If 
p = 0, there is no distinguished mean direction. We define m = just to 
have one in that case. 
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Example. If the distribution of X is located a single point xo, then p = 
1, m = xo and V = 0. If the distribution of X is the uniform distribution 
on the circle, then p = 0,V = 1. There is no particular mean direction in 

2 /o 

this case. For the wrapped normal distribution m = 0,p = e~° ' ,V = 
l-e-° 2 ' 2 . 

The following lemma is analogous to theorem (2.5.5): 



Theorem 5.8.1 (Chebychev inequality on the circle). If X is a circular 
random variable with circular mean m and variance V , then 

P[|sin((A-m)/2)|>e]<^. 



Proof. We can assume without loss of generality that m = 0, otherwise 
replace X with X — m which does not change the variance. We take T = 
[— 7T, 7r). We use the trigonometric identity 1 — cos(a;) = 2sin 2 (a;/2), to get 

V = E[l - cos(X)} = 2E[sin 2 (y)] 

> 2E[l |sin( x )| > e sin(|)] 

> 2e 2p[| sin( |)|> e ]. 

□ 

Example. Let X be the random variable which has a discrete distribution 
with a law supported on the two points x = xq = and x = x± = 
±2arcsin(e) and P[X = x ] = 1- V/{2e 2 ) and P[X = x±] = V/(Ae 2 ). This 
distribution has the circular mean m and the variance V. The equality 

P[| sin(X/2)| > e] = 2V/(4e 2 ) = V/(2e 2 ) . 

shows that the Chebychev inequality on the circle is " sharp" : one can not 
improve it without further assumptions on the distribution. 

Definition. A sequence of circle-valued random variables X n converges 
weakly to a circle-valued random variable X if the law of X n converges 
weakly to the law of X. As with real valued random variables weak con- 
vergence is also called convergence by law. 

Example. The sequence X n of significant digit random variables X n con- 
verges weakly to a random variable with lattice distribution P[X = k] = 
log 10 (fc + 1) — log 10 (fc) supported on {k2ir/10 | < k < 10 }. It is called 
the distribution of the first significant digit. The interpretation is that if 
you take a large random number, then the probability that the first digit 
is 1 is log(2), the probability that the first digit is 6 is log(7/6). The law is 
also called Benford's law. 
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Definition. The characteristic function of a circle- valued random variable 
X is the Fourier transform <f>x = v of the law of X. It is a sequence (that 
is a function on Z) given by 



Definition. More generally, the characteristic function of a T -valued ran- 
dom variable (circle-valued random vector) is the Fourier transform of the 
law of X . It is a function on 1 d given by 



The following lemma is analog to corollary (2.17). 



Lemma 5.8.2. A sequence X n of circle-valued random variables converges 
in law to a circle- valued random variable X if and only if for every integer 
k, one has <f>x n (k) — > <f)x{k) for n — > oo. 



Example. A circle valued random variable with probability density function 
f(x) = Ce KCOS ( x ~ a ' i s called the Mises distribution. It is also called the 
circular normal distribution. The constant C is 1/(27t/o(k)), where Io(k) = 
^^L (re/2) 2n /(n! 2 ) a modified Bessel function. The parameter k is called 
the concentration parameter, the parameter a is called the mean direction. 
For k — y 0, the Mises distribution approaches the uniform distribution on 
the circle. 






Figure. The density function of 
the Mises distribution on [— tt, tt] . 



Figure. The density function of 
the Mises distribution plotted as a 
polar graph. 
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Proposition 5.8.3. The Mises distribution maximizes the entropy among all 
circular distributions with fixed mean a and circular variance V. 



Proof. If g is the density of the Mises distribution, then log(<?) = k cos(x — 
a) + log(C) and H(g) = np + 2tt log(C). 
Now compute the relative entropy 

0>H(f\g) = J f(x)\og(f(x))dx- J f(x)\og(g(x))dx . 

This means with the resultant length p of / and g: 

H{f) > -E[k cos[x -a)+ log(C)] = -up + 2tt log(C) = H{g) . 

□ 

Definition. A circle-valued random variable with probability density func- 
tion 

/(*) = — L= £ e-<— - 2fe ^ 2 2a 2 
V27rcH , 

k— — oo 

is the wrapped normal distribution. It is obtained by taking the normal 
distribution and wrapping it around the circle: if X is a normal distribu- 
tion with mean a and variance a 2 , then X mod 1 is the wrapped normal 
distribution with those parameters. 



Example. A circle-valued random variable with constant density is called 
a random variable with the uniform distribution. 



Example. A circle-valued random variable with values in a closed finite 
subgroup H of the circle is called a lattice distribution. For example, the 
random variable which takes the value with probability 1/2, the value 
27r/3 with probability 1/4 and the value Att/3 with probability 1/4 is an 
example of a lattice distribution. The group H is the finite cyclic group Z3 . 

Remark. Why do we bother with new terminology and not just look at real- 
valued random variables taking values in [0, 2ir)? The reason to change the 
language is that there is a natural addition of angles given by rotations. 
Also, any modeling by vector-valued random variables is kind of arbitrary. 
An advantage is also that the characteristic function is now a sequence and 
no more a function. 



Distribution 


Parameter 


characteristic function 


point 


XQ 


fa (k) = e lkXo 


uniform 




fa(k) = for k ^ and fa(0) = 1 


Mises 


k, a = 




wrapped normal 


cr, a = 
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The functions are modified Bessel functions of the first kind of fc'th 

order. 

Definition. If X ll X 2l ... is a sequence of circle- valued random variables, 
define S n = X\ H + X n . 



Theorem 5.8.4 (Central limit theorem for circle- valued random variable). 
The sum S n of IID-valued circle- valued random variables Xi which do 
not have a lattice distribution converges in distribution to the uniform 
distribution. 



Proof. We have |</>x(fc)l < 1 for all k ^ because if 4>x(k) = 1 for some 
k 7^ 0, then X has a lattice distribution. Because 0s„(fc) = 111=1 'PxXk), 
all Fourier coefficients <f>s n (k) converge to for n — > oo for k ^ 0. □ 

Remark. The IID property can be weakened. The Fourier coefficients 

4>x n (k) = l — a nk 

should have the property that X^^Li a nk diverges, for all k, because then, 
rin=i(l — a nk) 0. If Xi converges in law to a lattice distribution, then 
there is a subsequence, for which the central limit theorem does not hold. 

Remark. Every Fourier mode goes to zero exponentially. If 4>x (k) < 1 — S 
for S > and all k ^ 0, then the convergence in the central limit theorem 
is exponentially fast. 

Remark. Naturally, the usual central limit theorem still applies if one con- 
siders a circle- valued random variable as a random variable taking values in 
[—it, 7r] Because the classical central limit theorem shows that Y17=i ^n/ \fn 
converges weakly to a normal distribution, X)"=i X n /y/n mod 1 converges 
to the wrapped normal distribution. Note that such a restatement of the 
central limit theorem is not natural in the context of circular random vari- 
ables because it assumes the circle to be embedded in a particular way in 
the real line and also because the operation of dividing by n is not natural 
on the circle. It uses the field structure of the cover K. 

Example. Circle- valued random variables appear as magnetic fields in math- 
ematical physics. Assume the plane is partitioned into squares [j,j + 1) x 
[k, k+1) called plaquettes. We can attach IID random variables Bjk = e lXjk 
on each plaquettc. The total magnetic field in a region G is the product of 
all the magnetic fields Bjk in the region: 

(j,fe)GG 

The central limit theorem assures that the total magnetic field distribution 
in a large region is close to a uniform distribution. 



5.8. Circular random variables 



333 



Example. Consider standard Brownian motion Bt on the real line and its 
graph of {(t, Bt) \ t £ R } in the plane. The circle- valued random variables 
X n = B n mod 1 gives the distance of the graph at time t = n to the 
next lattice point below the graph. The distribution of X n is the wrapped 
normal distribution with parameter m = and a = n. 



Figure. The graph of one- 
dimensional Brownian motion 
with a grid. The stochastic pro- 
cess produces a circle-valued ran- 
dom variable X n = B n mod 1. 
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If X, Y are real- valued IID random variables, then X+Y is not independent 
of X. Indeed X + Y and Y are positively correlated because 

Cov[X + Y,Y] = Cov[A, Y] + Cov[Y, Y] = Cov[F, Y] = Var[Y] > . 

The situation changes for circle-valued random variables. The sum of two 
independent random variables can be independent to the first random vari- 
able. Adding a random variable with uniform distribution immediately ren- 
ders the sum uniform: 



Theorem 5.8.5 (Stability of the uniform distribution). If X, Y are circle- 
valued random variables. Assume that Y has the uniform distribution and 
that X, Y are independent, then X + Y is independent of X and has the 
uniform distribution. 



Proof. We have to show that the event A = {X + Y £ [c,d] } is indepen- 
dent of the event B = {X £ [a,b] }. To do so we calculate P[4 fl B] = 

J f _ x fx{x)fY{y) dydx. Because Y has the uniform distribution, we get 
after a substitution u = y — x, 

/ / fx(x)f Y (y)dydx= / f x (x)f Y (u)dudx = P[A]P[B]. 

J a J c—x J a J c 



By looking at the characteristic function 4>x+y = 4>x4>y = we see that 
X + Y has the uniform distribution. □ 
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The interpretation of this lemma is that adding a uniform random noise to 
a given uniform distribution makes it uniform. 

On the n-dimensional torus T d , the uniform distribution plays the role of 
the normal distribution as the following central limit theorem shows: 



Theorem 5.8.6 (Central limit theorem for circular random vectors). The 
sum S n of IID- valued circle-valued random vectors X converges in distri- 
bution to the uniform distribution on a closed subgroup H of G. 



Proof. Again < 1- Let A denote the set of k such that 4>x(k) = 1. 

(i) A is a lattice. If / e lkx ^ dx = 1 then X(x)k = 1 for all x. If A, A 2 are 
in A, then Ai + A 2 G A. 

(ii) The random variable takes values in a group H which is the dual group 
of Z d /H. 

(iii) Because <ps n (k) = YYk=i ^Xiik), all Fourier coefficients <f>S„(k) which 
are not 1 converge to 0. 

(iv) (j>s„(k) — > 1a, which is the characteristic function of the uniform dis- 
tribution on H. □ 

Example. If G = T 2 and A = {. . . , (-1, 0), (1, 0), (2, 0), . . . }, then the ran- 
dom variable X takes values in H = {(0, y) | y £ T 1 }, a one dimensional 
circle and there is no smaller subgroup. The limiting distribution is the 
uniform distribution on that circle. 

Remark. If A is a random variable with an absolutely continuous distribu- 
tion on T d , then the distribution of S n converges to the uniform distribution 
on T d . 



Exercise. Let Y be a real-valued random variable which has standard 
normal distribution. Then X(x) = Y(x) mod 1 is a circle-valued ran- 
dom variable. If Y± are IID normal distributed random variables, then 
S„ = Yi + ■ ■ ■ + Y„ mod 1 are circle- valued random variable. What is 

Cov[S n ,S m ]7 

The central limit theorem applies to all compact Abelian groups. Here is 
the setup: 
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Definition. A topological group G is a group with a topology so that addi- 
tion on this group is a continuous map from G x G — > G and such that the 
inverse x — > x from G to G is continuous. If the group acts transitively 
as transformations on a space H, the space H is called a homogeneous 
space. In this case, H can be identified with G/G x , where G x is the isotopy 
subgroup of G consisting of all elements which fix a point x. 

Example. Any finite group G with the discrete topology d(x, y) = 1 if x ^ y 
and d(x, y) = if x = y is a topological group. 

Example. The real line R with addition or more generally, the Euclidean 
space R d with addition are topological groups when the usual Euclidean 
distance is the topology. 

Example. The circle T with addition or more generally, the torus T d with 
addition is a topological group with addition. It is an example of a compact 
Abelian topological group. 

Example. The general linear group G = GZ(n,R) with matrix multiplica- 
tion is a topological group if the topology is the topology inherited as a sub- 
set of the Euclidean space K™ of n x n matrices. Also subgroup of Gl(n, R), 
like the special linear group SL(n,M.) of matrices with determinant 1 or 
the rotation group SO(n, R) of orthogonal matrices are topological groups. 
The rotation group has the sphere S n as a homogeneous space. 

Definition. A measurable function from a probability space (£l,A,P) to 
a topological group (G, B) with Borel er-algebra B is is called a G-valued 
random variable. 

Definition. The law of a spherical random variable X is the push-forward 
measure /i = X*P on G. 

Example. If (G, A, P) is a the probability space by taking a compact topo- 
logical group G with a group invariant distance d, a Borel cr-algebra A and 
the Haar measure P, then X(x) — x is a group valued random variable. 
The law of X is called the uniform distribution on G. 

Definition. A measurable function from a probability space (fi, A, P) to the 
group (G, B) is called a G-valued random variable. A measurable function 
to a homogeneous space is called H- valued random variable. Especially, 
if H is the d-dimensional sphere (S d ,B) with Borel probability measure, 
then X is called a spherical random variable. It is used to describe spherical 
data. 



5.9 Lattice points near Brownian paths 

The following law of large numbers deals with sums S n of n random vari- 
ables, where the law of random variables depends on n. 
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Theorem 5.9.1 (Law of large numbers for random variables with shrinking 
support). If Xi are IID random variables with uniform distribution on [0, 1] . 
Then for any < 5 < 1, and A n = [0, l/n 5 } 7 we have 

1 " 

hm ^ Ui (X fe )^l 

71— too TL L ' 

k=l 

in probability. For S < 1/2, we have almost everywhere convergence. 



Proof. For fixed n, the random variables Z-^(x) = ljo.i/n 5 ] (Xk) are indepen- 
dent, identically distributed random variables with mean E[Zk] = p = l/n s 
and variance p(l —p). The sum S n = Y^k=i nas a binomial distribution 
with mean np = n 1_l5 and variance Var[S'„] = np(l — p) = n 1 "* 5 ^ — p). 
Note that if n changes, then the random variables in the sum S n change 
too, so that we can not invoke the law of large numbers directly. But the 
tools for the proof of the law of large numbers still work. 

For fixed e > and n, the set 

B„ = {ze[0,l]||^#-l|>e} 

has by the Chcbychcv inequality (2.5.5), the measure 

S n w 2 _ Var[S„] 1-p ^ 1 



P[B n ] < Varl-^j/e 



< 



This proves convergence in probability and the weak law version for all 
S < 1 follows. 

In order to apply the Borcl-Cantelli lemma (2.2.2), we need to take a sub- 
sequence so that YlkLi P[^«J converges. Like this, we establish complete 
convergence which implies almost everywhere convergence. 

Take k = 2 with k{1 — 5) > 1 and define = k K = k 2 . The event B = 
limsupk B nk has measure zero. This is the event that we are in infinitely 
many of the sets B nk . Consequently, for large enough k, we are in none of 
the sets B nk : if x £ B, then 

i 5 ",^) 1 i <£ 



1-5 



for large enough k. Therefore, 



i Sn.+ljx) , | S nk (x) Sl(T£(x)) 

1-S I — I 1-5 I + 1-5 

n k n k n k 
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Because for nu = k 2 we have rik+i — nk = 2k + 1 and 

Si(T?(x)) 2k + 1 

„l-4 - fe 2(l-d) ■ 

For <5 < 1/2, this goes to zero assuring that we have not only convergence 
of the sum along a subsequence S nh but for S n (compare lemma (2.11.2)). 
We know now | ^"-^ — 1 1 — >• almost everywhere for n — > oo. □ 

Remark. If we sum up independent random variables Zk — n 5 l[o,i/n 5 ](^fc) 
where are IID random variables, the moments E[Z™] = rjv m_1 ) (5 be- 
come infinite for m > 2. The laws of large numbers do not apply be- 
cause E[Z 2 ] depends on n and diverges for n —> oo. We also change the 
random variables, when taking larger sums. For example, the assumption 
su Pn „ SiLi Var[JQ] < oo does not apply. 

Remark. We could not conclude the proof in the same way as in theo- 
rem (2.9.3) because U n = is not monotonically increasing. For 
5 G [1/2, 1) we have only proven a weak law of large numbers. It seems 
however that a strong law should work for all S < 1. 

Here is an application of this theorem in random geometry. 



Corollary 5.9.2. Assume we place randomly n discs of radius r = 1/rt 1 / 2-5 / 2 
onto the plane. Their total area without overlap is imr 2 = nn s . If S n is the 
number of lattice points hit by the discs, then for 5 < 1/2 



S n 

almost surely. 



^ 7T 

n 5 



Figure. Throwing randomly 
discs onto the plane and count- 
ing the number of lattice points 
which are hit. The size of the 
discs depends on the number of 
discs on the plane. If 5 = 1/3 
and if n = l'OOO'OOO, then we 
have discs of radius 1/10000 
and we expect S n , the number of 
lattice point hits, to be 1007T. 





— 1 


i 


i 














4 






































w 




















• 


























• 
























m 








































i 
















i 


p 
















1 


i 












































9 








- 


m 






























< 


f 










































9- 


• 












































» 
















































-© 


i— 




i 


> 































































338 



Chapter 5. Selected Topics 



Remark. Similarly as with the Buffon needle problem mentioned in the in- 
troduction, we can get a limit. But unlike the Buffon needle problem, where 
we keep the setup the same, independent of the number of experiments. We 
adapt the experiment depending on the number of tries. If we make a large 
number of experiments, we take a small radius of the disk. The case 6 = 
is the trivial case, where the radius of the disc stays the same. 

The proof of theorem (5.9.1) shows that the assumption of independence 
can be weakened. It is enough to have asymptotically exponentially decor- 
rclatcd random variables. 

Definition. A measure preserving transformation T of [0, 1] has decay of 
correlations for a random variable X satisfying E[X] = 0, if 



for some constant C > 0, then X has exponential decay of correlations. 



Lemma 5.9.3. If B f is standard Brownian motion. Then the random vari- 
ables X„ = B n mod 1 have exponential decay of correlations. 



Proof. B n has the standard normal distribution with mean and standard 
deviation a — n. The random variable X n is a circle- valued random variable 
with wrapped normal distribution with parameter a = n. Its characteris- 
tic function is 4>x{k) = e~ k G I" 1 . We have X n+m = X n + Y m mod 1, 
where X n and Y m are independent circle- valued random variables. Let 



9n = Efelo e~ fc2 ™ 2 / 2 cos(fca;) = 1 - e(x) > 1 - e~ c " 2 be the density of X, 



which is also the density of Y n . We want to know the correlation between 



Cov[X,X(T n )} -> 



for n 



oo. If 



Cov[X,X(T n )} < e 



-Cn 



and X n : 




With u = x + y, this is equal to 





f(x)f(u)(l - e(x))(l - e(u ~ x)) dudx 



< Ci|/|Le 



□ 
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Proposition 5.9.4. If T : [0, 1] — > [0, 1] is a measure-preserving transfor- 
mation which has exponential decay of correlations for Xj. Then for any 
5 G [0, 1/2), and A n = [0, l/n% we have 

1 ™ 

lira ^l An (T fc (x))^l. 

k=\ 



Proof. The same proof works. The decorrelation assumption implies that 
there exists a constant C such that 

CovlX^Xj] < C . 

i^j<n 

Therefore, 

Var[S„] = nVar[X n ] + £ Cov^X,-] < £ e" ^) 2 . 

The sum converges and so Var[5„] = ?iVar[X;] + C. □ 

Remark. The assumption that the probability space Q is the interval [0, 1] is 
not crucial. Many probability spaces (f2, A, P) where Q is a compact metric 
space with Borel cr-algebra A and P[{x}] = for all X G f2 is measure 
theoretically isomorphic to ([0, 1],S, dx), where i3 is the Borel cr-algebra 
on [0, 1] (see [13] proposition (2.17). The same remark also shows that 
the assumption A n = [0, l/n s ] is not essential. One can take any nested 
sequence of sets A n £ A with P[^4„] = l/n s , and A n+ i C A n . 



Figure. We can apply this propo- 
sition to a lattice point prob- 
lem near the graphs of one- 
dimensional Brownian motion, 
where we have a probability space 
of paths and where we can make 
a statement about almost every 
path in that space. This is a re- 
sult in the geometry of numbers 
for connected sets with fractal 
boundary. 
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Corollary 5.9.5. Assume B t is standard Brownian motion. For any < 5 < 
1/2, there exists a constant C, such that any l/n 1+s neighborhood of the 
graph of B over [0, 1] contains at least C /n 1 ^ 5 lattice points, if the lattice 
has a minimal spacing distance of 1 jn. 



Proof. B t+ i/ n mod 1/n is not independent of B t but the Poincare return 
map T from time t = k/n to time (k + l)/n is a Markov process from 
[0, 1/n] to [0, 1/n] with transition probabilities. The random variables Xi 
have exponential decay of correlations as we have seen in lemma (5.9.3). □ 



Remark. A similar result can be shown for other dynamical systems with 
strong recurrence properties. It holds for example for irrational rotations 
with T(x) = x + q mod 1 with Diophantine a, while it does not hold for 
Liouville a. For any irrational a, we have /„ = 5Zfc=i ^A n (T k (x)) near 
1 for arbitrary large n = qi, where pi / qi is the periodic approximation of 
S. However, if the qi are sufficiently far apart, there are arbitrary large n, 
where f n is bounded away from 1 and where /„ do not converge to 1. 

The theorem we have proved above belongs to the research area of geome- 
try of numbers. Mixed with probability theory it is a result in the random 
geometry of numbers. 

A prototype of many results in the geometry of numbers is Minkowski's 
theorem: 



Theorem 5.9.6 (Minkowski theorem). A convex set M which is invariant 
under the map T(x) = —x and with area > 4 contains a lattice point 
different from the origin. 



Proof. One can translate all points of the set M back to the square fl = 
[—1,1] x [—1,1]. Because the area is > 4, there are two different points 
(x,y),(a,b) which have the same identification in the square f2. But if 
(x, y) = (u+2fc, v+2l) then (x — u, y—v) = (2k, 21). By point symmetry also 
(a, b) = (—it, —v) is in the set M. By convexity ((x + a)/2, (y + b)/2) = (fc, I) 
is in M. This is the lattice point we were looking for. □ 
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Figure. A convex, symmetric set 
M. For illustration purposes, the 
area has been chosen smaller 
than 4 in this picture. The theo- 
rem of Minkowski assumes, it is 
larger than 4. 



Figure. Translate all points back 
to the square [—1,1] x [—1,1] of 
area 4. One obtains overlapping 
points. The symmetry and con- 
vexity allows to conclude the ex- 
istence of a lattice point in M . 




There are also open questions: 

• The Gauss circle problem asks to estimate the number of 1/ n-lattice 
points g(n) = mi 2 + E(n) enclosed in the unit disk. One believes that 
an estimate E{n) < Cn e holds for every 8 > 1/2. The smallest 6 for 
which one knows the is 8 = 46/73. 

• For a smooth curve of length 1 which is not a line, we have a similar 
result as for the random walk but we need S < 1/3. Is there a result 
for S < 1? 

• If we look at Brownian motion in M. d . How many 1/n lattice points 
are there in a Wiener sausage, in a l/n 1+s neighborhood of the path? 

5.10 Arithmetic random variables 

Because large numbers are virtually infinite - we have no possibility to in- 
spect all of of the numbers from f2„ = {l,...n = 10 100 } for example - 
functions like X n = k 2 + 5 mod n are accessible on a small subset only. The 
function X n behaves as random variable on an infinite probability space. If 
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we could find the events U n = {X n = } easily, then factorization would 
be easy as its factors can be determined from in U n . A hnite but large 
probability space Q n can be explored statistically and the question is how 
much information we can draw from a small number of data. It is unknown 
how much information can we get from a large integer n with finitely many 
computations. Can we statistically recover the factors of n from 0(log(n)) 
data points (kj,Xj), where Xj = n mod kj for example? 

As an illustration of how arithmetic complexity meets randomness, we con- 
sider in this section examples of number theoretical random variables, which 
can be computed with a fixed number of arithmetic operations. Both have 
the property that they appear to be "random" for large n. These functions 
belong to a class of random variables 

X{k) = p(k, n) mod q(k, n) , 

where p and q are polynomials in two variables. For these functions, the 
sets X~ l (a) = {X(k) = a } are in general difficult to compute and 
Y (k) = X(k), Yi(fc) = X(k + 1), . . . , Yi{k) = X(k + I) behave very much 
as independent random variables. 

To deal with "number theoretical randomness" , we use the notion of asymp- 
totically independence. Asymptotically independent random variables ap- 
proximate independent random variables in the limit n — > oo. With this 
notion, we can study fixed sequences or deterministic arithmetic functions 
on finite probability spaces with the language of probability, even so there is 
no fixed probability space on which the sequences form a stochastic process. 

Definition. A sequence of number theoretical random variables is a col- 
lection of integer valued random variables X n defined on finite probability 
spaces (n n ,A n ,P n ) for which f2„ C £l n +i and A n is the set of all subsets 
of Q n . An example is a sequence X n of integer valued functions defined 
on f2 n = {0, . . . , n — 1 }. If there exists a constant C such that X n on 
{0, . . . , n } is computable with a total of less than C additions, multiplica- 
tions, comparisons, greatest common divisor and modular operations, we 
call X a sequence of arithmetic random variables. 

Example. For example 

X n {x) = (((a; 5 - 7) mod 9) 3 x - x 2 ) mod n 

defines a sequence of arithmetic random variables on O n = {0, . . . , n— 1 }. 

Example. If x n is a fixed integer sequence, then X n (k) = Xk on il n = 
{0, . . . , n — 1 } is a sequence of number theoretical random variables. For 
example, the digits x n of the decimal sequence of ir defines a sequence 
of number theoretical random variables X n (k) — x n for k < n. However, 
in the case of 7T, it is not known, whether this sequence is an arithmetic 
sequence. It would be a surprise, if one could compute x n with a finite n- 
independent number of basic operations. Also other deterministic sequences 
like the decimal expansions of n, y/2 or the Mobius function jit(n) appear 
" random" . 
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Remark. Unlike for discrete time stochastic processes X n , where all ran- 
dom variables X n are defined on a fixed probability space (Q,A, P), an 
arithmetic sequence of random variables X n uses different finite probabil- 
ity spaces (fi n , An,P n )- 

Remark. Arithmetic functions arc a subset of the complexity class P of 
functions computable in polynomial time. The class of arithmetic sequences 
of random variables is expected to be much smaller than the class of se- 
quences of all number theoretical random variables. Because computing 
gcd(:r, y) needs less than C(x + y) basic operations, we have included it 
too in the definition of arithmetic random variable. 



Definition. If lim„^oo E[X„] exists, then it is called the asymptotic expec- 
tation of a sequence of arithmetic random variables. If limn-yoo Var[X„] 
exists, it is called the asymptotic variance. If the law of X n converges, the 
limiting law is called the asymptotic law. 

Example. On the probability space il n = [1, • ■ • , n] x [1, . . . , n], consider the 
arithmetic random variables Xd = ls d , where Sd — {(n, to), gcd(n, m) 



Pro posit ion 5.10.1. The asymptotic expectation P n [Si] = E„[Xi] is 6/w 2 . 
In other words, the probability that two random integers arc relatively 
prime is 6 /it 2 . 



Proof. Because there is a bijection (ft between Si on [1, . . . ,n] 2 and Sd on 
[l,...,dn] 2 realized by 4>{j,k) — > (dj,dk), we have \S\\/n 2 = \Sd\/(d 2 n 2 ). 
This shows that E n [Xi]/E„[Xj — > d 2 has a limit 1/d 2 for n — > oo. To 
know P[5i], we note that the sets Sd form a partition of N 2 and also when 
restricted to Q n . Because P[Sd] = P[Si]/d 2 , one has 



d}. 



P [Sl]-(7I + ;- + - + ...) = P[S 1 




so that P[5i] = 6/tt 2 . 



□ 
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Figure. The probability that two 
random integers are relatively 
prime is 6/ir 2 . A cell (j,k) 
in the finite probability space 
[1, ...,n] x is painted 

black i/gcd(j, k) = 1. The proba- 
bility that gcd(j, k) — 1 is 6/ir 2 = 
0.607927 ... in the limit n —> oo. 
So, if you pick two large num- 
bers (j, k) at random, the change 
to have no common divisor is 
slightly larger than to have a 
common divisor. 



Exercise. Show that the asymptotic expectation of the arithmetic random 
variable X„(x, y) = gcd(£, y) on [1, . . . , n} 2 is infinite. 



Example. A large class of arithmetic random variables is defined by 

X„(k) = p(n, k) mod q(n, k) 

on Q n = {0, . . . , n — 1 } where p and q are not simultaneously linear poly- 
nomials. We will look more closely at the following two examples: 

1) X n (k) = n 2 + c mod k 

2) X n (k) = k 2 +c mod n 



Definition. Two sequences X n ,Y n of arithmetic random variables, (where 
X n ,Y n are defined on the same probability spaces f2„), are called uncor- 
related if Cov[X n ,Y n ] = 0. The are called asymptotically uncorrelated, if 

their asymptotic correlation is zero: 

Cov[x„,y„]^o 

for n — > oo. 

Definition. Two sequences X, Y of arithmetic random variables are called 
independent if for every n, the random variables X n ,Y n are independent. 
Two sequences X, Y of arithmetic random variables with values in [0, n] 
are called asymptotically independent, if for all /, J, we have 

n n n n 



for n — >• oo. 
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Remark. If there exist two uncorrelated sequences of arithmetic random 
variables U, V such that — ^n||L 2 (n„) — > and \\V n — Y n \\L 2 (n n ) — > 0, 
then X, Y are asymptotically uncorrelated. If the same is true for indepen- 
dent sequences U, V of arithmetic random variables, then X, Y are asymp- 
totically independent. 

Remark. If two random variables are asymptotically independent, they are 
asymptotically uncorrelated. 

Example. Two arithmetic random variables X n (k) = k mod n and Y n (k) = 
ak + b mod n are not asymptotic independent. Lets look at the distribution 
of the random vector (X n , Y n ) in an example: 



Figure. The figure shows 
the points (X n (k),Y n [k)) for 
X n (k) = k,Y n {k) = 5fc + 3 
modulo n in the case n = 2000. 
There is a clear correlation be- 
tween the two random variables. 



Exercise. Find the correlation of X n (k) = k mod n and Y n (k) = 5k + 
3 mod n. 



Having asymptotic correlations between sequences of arithmetic random 
variables is rather exceptional. Most of the time, we observe asymptotic 
independence. Here are some examples: 

Example. Consider the two arithmetic variables X n (k) — k and 

Y n (k) = ck~ mod p(n) , 

where c is a constant and p(n) is the n'th prime number. The random 
variables X n and Y n are asymptotically independent. Proof: by a lemma of 
Merel [69, 23], the number of solutions of (x, y) € I x J of xy = c mod p is 

JM + (// 2 iog 2 (p)). 

p 

This means that the probability that X n Jn £ I n ,Y n /n E J n is ■ \ J n \. 
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Figure. Illustration of the lemma 
of Merel. The picture shows the 
points {(fc, 1/fc) mod p }, where 
p is the 200'th prime number 
p(200) = 1223. 



Nonlinear polynomial arithmetic random variables lead in general to asymp- 
totic independence. Lets start with an experiment: 



Figure. We see the points 
(X n (k),Y n (k)) for X n (k) = 
k,Y n (k) = k 2 + 3 in the case 
n = 2001. Even so there are 
narrow regions in which some 
correlations are visible, these 
regions become smaller and 
smaller for n — > oo. Indeed, we 
will show that X n , Y n are asymp- 
totically independent random 
variables. 



The random variable X n (k) = (n 2 + c) mod k on {1, . . . , n} is equivalent 
to X n (k) = n mod k on {0, . . . , [y/n — c\ }, where [x] is the integer part of 
x. After the rescaling the sequence of random variables is easier to analyze. 

To study the distribution of the arithmetic random variable X ni we can 
also rescale the image, so that the range in the interval [0, 1]. The random 
variable Y n = X n (x- |il„|) can be extended from the discrete set {fc/|fi„|)} 
to the interval [0, 1]. Therefore, instead of n 2 + c mod k, we look at 

. . n mod k n n 
X n (k) = = --[-] 

on f2 m („) = {1, . . . , m(n) }, where m(n) = y/n — c. 

Elements in the set X _1 (0) are the integer factors of n. Because factoring is 
a well studied NP type problem, the multi- valued function X^ 1 is probably 
hard to compute in general because if we could compute it fast, we could 
factor integers fast. 
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Proposition 5.10.2. The rescaled arithmetic random variables 

, , n mod k n r n, 

x ^ - -nr- = * - y 

converge in law to the uniform distribution on [0, 1]. 



Proof. The functions f^{k) = n/(k+r) — {n/ (k+r)] are piecewise continuous 
circle maps on [0, 1]. When rescaling the argument [0, . . . ,n], the slope of 
the graph becomes larger and larger for n — > oo. We can use lemma (5.10.3) 
below. □ 



Figure. Data points 
n mod k 

for n = lO'OOO and 1 < k < 
n. For smaller values of k, the 
data points appear random. The 
points are located on the graph of 
the circle map 



/«(*) 



n 

7 




To show the asymptotic independence of X n with any of its translations, 
we restrict the random vectors to [1, l/n a ] with a < 1. 



Lemma 5.10.3. Let /„ be a sequence of smooth maps from [0, 1] to the circle 
T 1 = R/Z for which (/ I 7 1 )"(x) — > uniformly on [0, 1], then the law \i n of 
the random variables X n (x) = (x, f n {x)) converges weakly to the Lebcsgue 
measure il — dxdy on [0, 1] x T 1 . 



Proof. Fix an interval [a, b] in [0, 1]. Because fi n ([a>, b] x T 1 ) is the Lebcsgue 
measure of {(x, y) \X n (x, y) € [a, b]} which is equal to b — a, we only need 
to compare 

Mn([a,6] x [c,c + dy]) 

and 

fi n ([a,b] x [d,d + dy]) 
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in the limit n — > oo. But £tn([a, b] x [c, c + dy]) — M«([ a J ^] x [ c j c + ^J/D i s 
bounded above by 

which goes to zero by assumption. 



Figure. Proof of the lemma. The 
measure \i n with support on the 
graph of f n (x) converges to the 
Lebesgue measure on the prod- 
uct space [0,1] x T 1 . The con- 
dition f" / f' 2 — > assures that 
the distribution in the y direction 
smooths it out. 




Theorem 5.10.4. Let c be a fixed integer and X n (k) = (n 2 + c) mod k 

on {1, . . . , n} For every integer r > 0, < a < 1, the random variables 
X{k),Y{k) = X(k + r) arc asymptotically independent and uncorrclatcd 
on [0,n a ]. 



Proof. We have to show that the discrete measures 'Y^ = i^{X{k) 1 Y{k)) 
converge weakly to the Lebesgue measure on the torus. To do so, we first 
look at the measure fi„ = J Q Y^j=i ${X(k), Y{k)) which is supported on 
the curve t >-> (X(t),Y(t)), where k € [0, n a ] with a < 1 converges weakly 
to the Lebesgue measure. When rescaled, this curve is the graph of the 
circle map f n (x) = 1/x mod 1 The result follows from lemma (5.10.3). □ 



Remark. Similarly, we could show that the random vectors (X(k),X(k + 
fx), X(k + r%), . . . , X(k + n)) are asymptotically independent. 

Remsirk. Polynomial maps like T(x) = x 2 + c are used as pseudo random 
number generators for example in the Pollard p method for factorization 
[87]. In that case, one considers the random variables {0, . . . ,n — 1} de- 
fined by Xo(k) = k, X n+ \{k) = T(X n (k)). Already one polynomial map 
produces randomness asymptotically as n — >• oo. 
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Theorem 5.10.5. If p is a polynomial of degree d > 2, then the distribution 
of Y(k) = p{k) mod n is asymptotically uniform. The random variables 
X(k) = k and Y(k) = mod n are asymptotically independent and 
uncorrclated. 



Proof. The map can be extended to a map on the interval [0, n]. The graph 
(x, T(x)) in {1, . . . , n} x {1, . . . , n} has a large slope on most of the square. 
Again use lemma (5.10.3) for the circle maps f n (x) = p(nx) mod n on 
[0,1]. □ 



Figure. The slope of the graph 
of p(x) mod n becomes larger 
and larger as n — > oo. Choos- 
ing an integer k G [0, n] pro- 
duces essentially a random value 
p(k) mod n. To prove the asymp- 
totic independence, one has to 
verify that in the limit, the push 
forward of the Lebesgue measure 
on [0,n] under the map f(x) = 
(x,p(x)) mod n converges in 
law to the Lebesgue measure on 
[0,n] 2 . 




Remark. Also here, we deal with random variables which are difficult to 
invert: if one could find y _1 (c) in 0(P(log(n)) times steps, then factoriza- 
tion would be in the complexity class P of tasks which can be computed 
in polynomial time. The reason is that taking square roots modulo n is at 
least as hard as factoring is the following: if we could find two square roots 
x, y of a number modulo n, then x 2 — y 2 mod n. This would lead to factor 
gcd(a; — y, n) of n. This fact which had already been known by Fermat. If 
factorization was a NP complete problem, then inverting those maps would 
be hard. 

Remark. The Mobius function is a function on the positive integers defined 
as follows: the value of /i(n) is defined as 0, if n has a factor p 2 with a prime p 
and is (— l) k , if it contains k distinct prime factors. For example, ^(14) = 1 
and ^(18) = and yu(30) = — 1. The Mertens conjecture claimed hat 



M{n) = |/i(l) 



for some constant C. It is now believed that M(n)/y/n is unbounded but it 
is hard to explore this numerically, because the \/log log(n) bound in the 
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law of iterated logarithm is small for the integers n we are able to compute 
- for example for n = 10 100 , one has y/log \og(n) is less then 8/3. The fact 

M(n) lA ... n 
n n 

fe=i 

is known to be equivalent to the prime number theorem. It is also known 
that limsup M{n)/y/n > 1.06 and liminf M(n)/y/n < -1.009. 
If one restricts the function [i to the finite probability spaces f2„ of all 
numbers < n which have no repeated prime factors, one obtains a sequence 
of number theoretical random variables X ni which take values in { — 1,1}. 
Is this sequence asymptotically independent? Is the sequence /i(n) random 
enough so that the law of the iterated logarithm 

sr^ 

Inn sup > — 1^^^=^^^= < 1 

n ^°° t~i v 2nlo s lo g( n ) 

holds? Nobody knows. The question is probably very hard, because if it 
were true, one would have 

M(n) < n 1/2+e , foralle>0 

which is called the modified Mertens conjecture . This conjecture is known 
to be equivalent to the Riemann hypothesis, the probably most notori- 
ous unsolved problem in mathematics. In any case, the connection with 
the Mobius functions produces a convenient way to formulate the Rie- 
mann hypothesis to non- mathematicians (see for example [14]). Actually, 
the question about the randomness of fj,(n) appeared in classic probability 
text books like Fellers. Why would the law of the iterated logarithm for 
the Mobius function imply the Riemann hypothesis? Here is a sketch of 
the argument: the Euler product formula - sometimes referred to as "the 
Golden key" - says 

«•>-£;?- II <'-£>-■ 

n— 1 p prime 

The function C(s) in the above formula is called the Riemann zeta function. 
With M(n) < n 1 ' 2+e , one can conclude from the formula 

C(s) ^ n s 

that could be extended analytically from Re(s) > 1 to any of the 
half planes Re(,s) > 1/2 + e. This would prevent roots of ((s) to be to the 
right of the axis Re(s) = 1/2. By a result of Riemann, the function A(s) = 
7r~ s / 2 r(s/2)e(s) is a meromorphic function with a simple pole at s = 1 and 
satisfies the functional equation A(s) = A(l — s). This would imply that 
C(s) has also no nontrivial zeros to the left of the axis Re(s) = 1/2 and 
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that the Riemann hypothesis were proven. The upshot is that the Riemann 
hypothesis could have aspects which are rooted in probability theory. 



Figure. The sequence Xk = 
fj,(l(k)), where l(k) is the k 
nonzero entry in the sequence 
{/i(l), /u(2), /i(3), . . . } produces a 
"random walk" S n — X)L-=i^fc- 
While Xk is a deterministic se- 
quence, the behavior of S n re- 
sembles a typical random walk. 
If that were true and the law of 
the iterated logarithm would hold, 
this would imply the Riemann 
hypothesis. 



5.11 Symmetric Diophantine Equations 

Definition. A Diophantine equation is an equation f(xi, . . . , Xk) = 0, where 
p is a polynomial in k integer variables x\ , . . . , Xk and where the polynomial 
/ has integer coefficients. The Diophantine equation has degree m if the 
polynomial has degree m. The Diophantine equation is homogeneous, if 
every summand in the polynomial has the same degree. A homogeneous 
Diophantine equation is also called a form. 

Example. The quadratic equation x 2 + y 2 — z 2 = is a homogeneous 
Diophantine equation of degree 2. It has many solutions. They are called 
Pythagorean triples. One can parameterize them all with two parameters 
s, t with x = 2si, y = s 2 — t 2 , z = s + 1 2 , as has been known since antiquity 
already [15]. 

Definition. A Diophantine equation of the form 

p(x ll ...,x k ) =p(yi,...,y k ) 

is called a symmetric Diophantine equation. More generally, a Diophantine 
equation 

i=l j=X 

is called an Euler Diophantine equation of type (fc, /) and degree m. It is a 
symmetric Diophantine equation if k = I. [29, 36, 15, 4, 5] 

Remark. An Euler Diophantine equation is equivalent to a symmetric Dio- 
phantine equation if m is odd and k + I is even. 
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Definition. A solution (xi, ..,Xk), (j/i, • • • ,Uk) to a symmetric Diophantine 
equation p(x) = p(y) is called nontrivial, if {xi, . . . , Xk } and {y%, . . . ,yk } 
arc different sets. For example, 5 3 + 7 3 + 3 3 = 3 3 + 7 3 + 5 3 is a trivial 
solution of p{x) = p(y) with p(x, y, z) = x 3 + y 3 + z 3 . 
The following theorem was proved in [70] : 



Theorem 5.11.1 (Jaroslaw Wroblewski 2002). For k > m, the Diophantine 
equation x™ + • • ■ + = y™ + • • ■ + y™ has infinitely many nontrivial 
solutions. 



Proof. Let R be a collection of different integer multi-sets in the finite 
set [0 7 ...,n] k . It contains at least n k /k\ elements. The set S = {p(x) = 
X™ + • • • + x™ € [0, Vkn m / 2 ] | x e R } contains at least n fc /fc! numbers. 
By the pigeon hole principle, there are different multi-sets x, y for which 
p(x) = p(y). This is the case if n k jk\ > \fkn m or n k ~ m > k\Vk. □ 

The proof generalizes to the case, where p is an arbitrary polynomial of 
degree m with integer coefficients in the variables x±, . . . , Xk- 



Theorem 5.11.2. For an arbitrary polynomial p in k variables of degree 
m, the Diophantine equation p(x) = p{y) has infinitely many nontrivial 
solutions. 



Remark. Already small deviations from the symmetric case leads to local 
constraints: for example, 2p(x) = 2p(y) + 1 has no solution for any nonzero 
polynomial p in k variables because there are no solutions modulo 2. 

Remark. It has been realized by Jean- Charles Meyrignac, that the proof 
also gives nontrivial solutions to simultaneous equations like p(x) = p(y) = 
p(z) etc. again by the pigeon hole principle: there are some slots, where more 
than 2 values hit. Hardy and Wright [29] (theorem 412) prove that in the 
case k = 2, m = 3: for every r, there are numbers which are representablc 
as sums of two positive cubes in at least r different ways. No solutions 
of x\ + y\ = x\ + y\ = x| + y| were known to those authors [29], nor 
whether there are infinitely many solutions for general (k,m) = (2, m). 
Mahler proved that a; 3 + y 3 + z 3 = 1 has infinitely many solutions. It is 
believed that x 3 +y 3 + z 3 +w 3 = n has solutions for all n. For (k, m) = (2, 3), 
multiple solutions lead to so called taxi-cab or Hardy-Ramanujan numbers. 

Remark. For general polynomials, the degree and number of variables alone 
does not decide about the existence of nontrivial solutions of p(x\, . . . , Xk) = 
p(yi, . . • , j/fe). There are symmetric irreducible homogeneous equations with 
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k < to/2 for which one has a nontrivial solution. An example is p(x, y) = 
x 5 — 4y 5 which has the nontrivial solution p(l, 3) = p(4, 5). 

Definition. The law of a symmetric Diophantine equation p{x\ 1 . . . , Xk) = 
p(x\ , . . . , Xfc) with domain f2 = [0, . . . , n] k is the law of the random variable 
defined on the finite probability space fl. 



Remark. Wroblewski's theorem holds because the random variable has an 
average density which is larger than the lattice spacing of the integers. So, 
there have to be different integers, which match. The continuum analog is 
that if a random variable X on a domain Q takes values in [a, b] and b — a 
is smaller than the area of Q, then the density fx is larger than 1 at some 
point. 

Remark. Wroblewski's theorem covers cases like x 2 + y 2 + z 2 = u 2 + v 2 +w 2 
or x 3 + y 3 + z 3 + w 3 = a 3 + b 3 + c 3 + d 3 . It is believed that for k > to/2, 
there are infinitely many solutions and no solution for k < m/2. [61]. 

Remark. For homogeneous Diophantine equations, it is enough to find a 
single nontrivial solution (xi, . . . , Xk) to obtain infinitely many. The reason 
is that (mil, ■ ■ ■ , rnxk) is a solution too, for any to ^ 0. 
Here are examples of solutions. Sources are [71, 36, 15]: 



k 


=2,ir 


=4 (59, 158) 4 = (133, 134) 4 (Euler, gave algebraic solutions in 1772 and 1778) 




k 


= 2,rr 


= 5 (open problem ([36]) all sums < 1.02 ■ 10 26 have been tested) 




k 


= 3, rr 


= 5 (3, 54, 62) 5 = (24, 28, 67) 5 ([61]. two parametric solutions by Mocssncr 1939 


Swinncrton-Dycr) 


k 


= 3, rr 


= 6 (3, 19, 22) 6 = (10, 15, 23) 6 ([29],Subba Rao, Brcmncr and Brudno parametr 


ic solutions) 


k 


= 3,n 


— T open problem? 




k 


=4,n 


= 7 (10, 14, 123, 149) 7 = (15, 90, 129, 146) 7 (Ekl) 




k 


=4,n 


— 8 open problem? 




k 




= 7 (8, 13, 16, 19) 7 = (2, 12, 15, 17, 18) 7 ([61]) 




k 




= 8 (1, 10, 11, 20, 43) 8 = (5, 28, 32, 35, 41) 8 . 




k 




= 9 (192, 101, 91, 30, 26) 9 = (180, 175, 116, 17, 12) 9 (Randy Ekl, 1997) 




k 




= 10 open problem 




k 


= 6,n 


= 3 (3, 19, 22) 6 = (10, 15, 23) 6 (Subba Rao [61]) 




k 


= 6,rr 


= 10 (95, 71, 32, 28, 25, 16) 10 = (92, 85, 34, 34, 23, 5) 10 (Randy Ekl, 1997) 




k 


= 6,rr 


= 11 open problem? 




k 


= 7. IT 


= 10 (1, 8, 31, 32, 55, 61, 68) 10 = (17, 20, 23, 44, 49, 64, 67) 10 ([61]) 




k 




= 12 (99, 77, 74, 73, 73, 54, 30) 12 = (95, 89, 88, 48, 42, 37, 3) 12 (Greg Childors, 2000) 


k 




= 13 open problem? 




k 


= 8,n 


= 11 (67, 52, 51, 51, 39, 38, 35, 27) 11 = (66, 60, 47, 36, 32, 30, 16, 7) 11 (Nuutti K 


josa, 1999) 


k 


= 20, 


m = 21 (76, 74, 74, 64, 58, 50, 50, 48, 48, 45, 41, 32, 21, 20, 10, 9, 8, 6, 4, 4) 21 






(77 


73, 70, 70, 67, 56, 47, 46, 38, 35, 29, 28, 25, 23, 16, 14, 11, 11, 3, 3) 21 (Greg Childc 


rs, 2000) 


k 


= 22, 


ti = 22 (85, 79, 78, 72, 68, 63, 61, 61, 60, 55, 43, 42, 41, 38, 36, 34, 30, 28, 24, 12, 11, 


ll) 22 




(83 


82, 77, 77, 76, 71, 66, 65, 65, 58, 58, 54, 54, 51, 49, 48, 47, 26, 17, 14, 8, 6) 22 (Greg 


Childcrs, 2000) 
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Figure. Known cases of (k, m) 
with nontrivial solutions x, y 
of symmetric Diophantine equa- 
tions g(x) = g(y) with g(x) = 
x" l + - ■ - + x™. Wroblewski's theo- 
rem assures that for k > m, there 
are solutions. The points above 
the diagonal beat Wroblewski 's 
theorem. The steep line m = 
2k is believed to be the thresh- 
old for the existence of nontrivial 
solutions. Above this line, there 
should be no solutions, below, 
there should be nontrivial solu- 
tions. 




What happens in the case k = ml There is no general result known. The 
problem has a probabilistic flavor because one can look at the distribution 
of random variables in the limit n — > oo: 



Lemma 5.11.3. Given a polynomial p(x\, . . . , xu) with integer coefficients 
of degree k. The random variables 

X n (xi, . . . ,x k ) =p(xi,..,x k )/n k 

on the finite probability spaces fi„ = [0, . . . ,n] k converge in law to the 
random variable X(x\, . . . ,x n ) = p(xi,..,Xk) on the probability space 
([0, l] fc ,S,P), where B is the Borel tr-algebra and P is the Lebesgue mea- 
sure. 



Proof. Let S a ,b(n) be the number of points (xi, . . . ,Xk) satisfying 
p(xi, . . . ,Xk) € [n k a, n k b] . 

This means 

j— = F n {b) - F n (a) , 

where F n is the distribution function of X n . The result follows from the fact 
that F n (b) — F n (a) = S a ,b{n)/n k is a Ricmann sum approximation of the in- 
tegral F(b)-F(a) = J A 1 dx, where A a<b = {x € [0, l] fc | X(xi, ...,x k )£ 
(a,b)}. □ 

Definition. Lets call the limiting distribution the distribution of the sym- 
metric Diophantine equation. By the lemma, it is clearly a piecewise smooth 
function. 
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Example. For k = 1, we have F(s) = P[X(x) < s] = P[x m < s] = s 1 '™ /n. 
The distribution for k = 2 for p(x,y) = x 2 + y 2 and p(x,y) — x 2 — y 2 
were plotted in the first part of these notes. The distribution function of 
p(xi, X2, ■ ■ ■ , Xk) is a k'th convolution product F^ = F -k ■ ■ ■ * F, where 
F{s) = 0{s 1 / m ) near s = 0. The asymptotic distribution of p(x, y) — x 2 +y 2 
is bounded for all m. The asymptotic distribution of p(x, y) — x 2 — y 2 
is unbounded near s = Proof. We have to understand the laws of the 
random variables X(x, y) = x 2 + y 2 on [0, l] 2 . We can see geometrically that 
{tt/A)s 2 < Fx(s) < s 2 . The density is bounded. For Y(x, y) = x 2 — y 2 , we 
use polar coordinates F(s) = {(r, 6) \ r 2 cos(26>)/2 < s }. Integration shows 
that F(s) = Cs 2 + f(s), where /(s) grows logarithmically as — log(s). For 
m > 2, the area x m — y m < s is piecewise differentiable and the derivative 
stays bounded. 

Remark. If p is a polynomial of k variables of degree k. If the density 
/ = F' of the asymptotic distribution is unbounded, then then there are 
solutions to the symmetric Diophantine equation p(x) = p{y). 



Corollary 5.11.4. (Generalized Wroblewski) Wroblewski's result extends to 
polynomials p of degree k for which at least one variable appears in a term 
of degree smaller than k. 



Proof. We can assume without loss of generality that the first variable 
is the one with a smaller degree m. If the variable x\ appears only in 
terms of degree k — 1 or smaller, then the polynomial p maps the finite 
space [0, ?i] fc / m x [0, n] k ^ 1 with n k + k / m - 1 = n k + t elements into the interval 
[min(p), max(p)] C [— Cn k ,Cn k ]. Apply the pigeon hole principle. □ 

Example. Let us illustrate this in the case p(x, y,z,w) = x 4 + x 3 + z 4 + w 4 . 
Consider the finite probability space O n = [0,n] x [0,n] x [0,n 4 / 3 ] x [0, n 
with n 4+1 / 3 . The polynomial maps f2 n to the interval [0,4n 4 ]. The pigeon 
hole principle shows that there are matches. 



Theorem 5.11.5. If the density f p of the random variable p on a surface 
Q C [0, n] k is larger than k\, then there are nontrivial solutions to p(x) = 

p(y)- 



In general, we try to find a subsets f2 C [0, n] k C R fe which contains n k ~ 13 
points which is mapped by X into [0, n m_Q ]. This includes surfaces, sub- 
sets or points, where the density of X is large. To decide about this, we 
definitely have to know the density of X on subsets. This works often be- 
cause the polynomials p modulo some integer number L do not cover all 
the conjugacy classes. Much of the research in this part of Diophantine 



356 



Chapter 5. Selected Topics 



equations is devoted to find such subsets and hopefully parameterize all of 
the solutions. 




Figure. X(x, y, z) = x 3 + y 3 + z s 



Figure. X(x, y, z) = x 3 + y 3 



Exercise. Show that there are infinitely many integers which can be written 
in non trivially different ways as x 4 + y 4 + z 4 — w 2 . 



Remark. Here is a heuristic argument for the "rule of thumb" that the Euler 
Diophantine equation x'{ 1 + ■ + x™ = x™ has infinitely many solutions for 
k > m and no solutions if k < m. 

For given n, the finite probability space Q = {(xi, . . . ,Xk) | < Xi < ?i 1 ^ m } 
contains n k / m different vectors x = (x\, . . . , Xkj- Define the random variable 

X(x) = (x? + --- + xT) 1/m . 

We expect that X takes values 1 / n k l m = n m l k close to an integer for large 
n because Y(x) — X{x) mod 1 is expected to be uniformly distributed on 
the interval [0, 1) as n — > oo. 

How close do two values Y(x),Y(y) have to be, so that Y(x) = Y{y)l 
Assume Y(x) = Y(y) + e. Then 

X(x) m = X(y) m + eXiy)" 1 - 1 + 0(e 2 ) 

with integers X (x) m , X (y) m . If X(y) m ~ 1 e < 1. then it must be zero so that 
Y{x) = Y{y). With the expected e = n m l k and X(y) m - 1 < Cn^™" 1 )/™ we 
see we should have solutions if k > m — 1 and none for k < m — 1. Cases 
like m = 3, k = 2, the Fermat Diophantine equation 



x 3 + y 3 = z 3 
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are tagged as threshold cases by this reasoning. 

This argument has still to be made rigorous by showing that the distri- 
bution of the points f(x) mod 1 is uniform enough which amounts to 
understand a dynamical system with multidimensional time. We see nev- 
ertheless that probabilistic thinking can help to bring order into the zoo 
of Diophantine equations. Here are some known solutions, some written in 
the Lander notation 

x m = (x 1 ,...,x k ) m = xT> + ---+x% . 





2k = 2: x 2 + y 2 = z 2 Pythagorean triplets like 3 2 + 4 2 = 5 2 (1900 BC). 








3k — 2: x + y = Z impossible, by Format's theorem. 








3, fc = 3: x 3 + y 3 + u 3 = v 3 derived from taxicab numbers, like 10 3 + 9 3 


= l 3 + 12 3 


(Victc 1591). 




4, fc = 3: 2682440 4 + 15365639 4 + 1S796760 4 = 20615673 4 (Elkies 1988 


[24]) m = 


5, fc = 3: like 




5,5 5 ■ 
y' J + z J = ■w- J is open 








4, fc = 4: 30 4 + 120 4 + 272 4 + 315 4 = 353 4 . (R. Norrie 1911 [36]) 








5, fc = 4 27 5 + 84 5 + 110 5 + 133 5 = 144 5 (Lander Parkin 1967). 








6, fc = 5: x G + y G + z 6 + u 6 + v G = io 6 is open. 








6, fc = 6: (74, 234, 402, 474, 702, 894, 1077) 6 = 1141 6 . 








7, fc = 7: (525, 439, 430, 413, 266, 258, 127) 7 = 568 7 (Mark Dodrill, 1999) 








8, fc = 8: (1324, 1190, 1088, 748, 524, 478, 223, 90) 8 = 1409 8 (Scott Chase) 








9, fc = 12, (91, 91, 89, 71, 68, 65, 43, 42, 19, 16, 13, 5) 9 = 103 9 (Jean-Charles 


Mcyrignac 


1997) 



5.12 Continuity of random variables 

Let X be a random variable on a probability space (SI, A, P). How can 
we see from the characteristic function (fix whether X is continuous or 
not? If it is continuous, how can we deduce from the characteristic function 
whether X is absolutely continuous or not? The first question is completely 
answered by Wieners theorem given below. The decision about singular 
or absolute continuity is more subtle. There is a necessary condition for 
absolute continuity: 



Theorem 5.12.1 (Riemann Lebesgue-lemma). If X £ C , then 4>x(n) — > 
for | tt, | — > oo. 



Proof. Given e > 0, choose n so large that the n'th Fourier approximation 
X n (x) = Ylk=- n 4>x (n)e mx satisfies \ \X — X n \\i < e. For m > n, we have 
cf> m (X n ) = E[e miX "] = so that 

\<t>x(m)\ - \<f>x-x n (m)\ < \\X - X^ < e . 



□ 
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Remark. The Riemann-Lebesgue lemma can not be reversed. There are 
random variables X for which (f>x(n) — > 0, but which X is not in £} . 
Here is an example of a criterion for the characteristic function which as- 
sures that X is absolutely continuous: 



Theorem 5.12.2 (Convexity). If 

(In — Q* — n 

satisfies o„ — >• for n — >• oo and 
a ra +i — 2a n + a n — l > 0, then there exists a random variable X £ L 1 for 
which <f>x{ri) = 



Proo/. We follow [49]. 

(i) fe„ = a n — a n +i decreases monotonically. 

Proof: the convexity condition is equivalent to a n — a n +i < fl n -i — o-n- 

(ii) &„ = o„ — a n+ i is non-negative for all n. 

Proof: b n decreases monotonically. If some b n — c < 0, then by (i), also 
b m < c for all m contradicting the assumption that b n — > 0. 

(iii) Also n&„ goes to zero. 

Proof: Because X)fc=i( a fc~ a fc+i) = a i~ a n+i is bounded and the summands 
are positive, we must have k(a k — fflfe+i) — > 0. 

( iv ) Z)fc=i fc ( a fc-i _ 2a k + afc+i) -> for n ->• oo. 

Proof. This sum simplifies to a — a n +i — ^(ctn — d n +i- By (iiii) , it goes to 
for n — > oo. 

(v) The random variable Y(x) = J^, 1 k(a k -\ — 2a^ + afe + i)A'fe(x) is in 
£ , if K k {x) is the Fejer kernel with Fourier coefficients 1 — \j\/(k + 1). 
Proof. The Fejer kernel is a positive summability kernel and satisfies 

\\Kk\\i = 7r~ / K k (x) dx = l. 
27T Jo 

for all k. The sum converges by (iv). 

(vi) The random variables X and Y have the same characteristic functions. 
Proof. 



0y(n) = ^ k(a k -i 

k=l 

oo 

= fc(ofc_i 
fe=i 

oo 

= fc(ofc_i 

n+l 



For bounded random variables, the existence of a discrete component of 
the random variable X is decided by the following theorem. It will follow 
from corollary (5.12.5) given later on. 



2a k + a k+ i)Kk{n) 



2a k + a k+1 )(l - -jj^) 
111 

2a k + o fc+ i)(l - jf^-j) = a 



□ 
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Theorem 5.12.3 (Wiener theorem). Given X G C°° with law fi supported 
in [— 7r,7r] and characteristic function <f> = fix- Then 

k=l i£l 

Therefore, X is continuous if and only if the Wiener averages 
\ Efc=i l^(fc)| 2 converge to 0. 



Lemma 5.12.4. If /i is a measure on the circle T with Fourier coefficients 
/tfe, then for every i£T, one has 



1 " 

i({x}) = lim — — V /} fc e lfc 

ri-s-oo 2n + 1 * — ' 
k— — n 



Proof. We follow [49]. The Dirichlet kernel 

ikt _ M(k + l/2)t) 



D n (t) = 



e 



sin(t/2) 



fe= — n 

satisfies 

n 

D„*/(z) - S„(/)(i) = £ f(k)e ikx 



The functions 



1 1 " 

fJt) = DJt - x) = V 

JnK 1 2n+ 1 V ; 2n+ I ^ 



—inx int 

e e 



are bounded by 1 and go to zero uniformly outside any neighborhood of 
t = x. From ^ 

lim f \d((x-(i({x})5 x )\ = 

follows 

lim (/„,/U- MM)) = 



so that 



i n 

(f n ,H - »({x})) = ^— ^ £ 0(n)e- - MM) . 



□ 



360 Chapter 5. Selected Topics 

Definition. If /i and v are two measures on (f2 = T, .4.), then its convolution 

is denned as 

(i*v(A)= / n(A — x)dv(x) 
Jt 

for any A £ A. Define for a measure on [— ir,ir] also H*(A) = fi(—A). 

Remark. We have (i*(n) = p,(n) and n*v{n) = jl(n)v(n). If [i = ^ a 3 -8 Xj 
is a discrete measure, then fi* = X^J^-^j- Because fj^-k [i* = J^j \ a j\ 2 ■• we 
have in general 

0**/O({o})=5>({*}) 2 i- 



Corollary 5.12.5. (Wiener) X^ |/x({a:})| 2 = Hm^^ EL-n IA«| 2 - 



Remark. For bounded random variables, we can rescale the random vari- 
able so that their values is in [—tt, 7r] and so that we can use Fourier series 
instead of Fourier integrals. We have also 

EKW)i a = R lim vf5 [ R \m\ 2 dt. 

£k R ^°° 2R J ~ R 



We turn our attention now to random variables with singular continuous 
distribution. For these random variables, one does have P[X = c] =0 for 
all c. Furthermore, the distribution function Fx of such a random variable 
X does not have a density. The graph of Fx looks like a Devil staircase. 
Here is a refinement of the notion of continuity for measures. 

Definition. Given a function h : M — >• [0, oo) satisfying luna^o h(x) — 0. A 
measure n on the real line or on the circle is called uniformly /i-continuous, 

if there exists a constant C such that for all intervals / = [a, b] on T the 
inequality 

< Ch{\I\) 

holds, where 1 1\ = b — a is the length of /. For h(x) = x a with < a < 1, 
the measure is called uniformly a-continuous. It is then the derivative of a 
a-H61dcr continuous function. 

Remark. If fi is the law of a singular continuous random variable X with 
distribution function Fx , then Fx is a-H61dcr continuous if and only if \x is 
a-continuous. For general h, one calls F uniformly lip — h continuous [89]. 



Theorem 5.12.6 (Y. Last). If there exists C, such that ^J2k=i\^k\ 2 < 
C ■ for all n > 0, then ji is uniformly v^i-continuous. 
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Proof. We follow [58]. The Dirichlet kernel satisfies 

E lAfel 2 = / / D n (y - x) dn(x)dn{y) 
and the Fejer kernel K n {t) satisfies 
K n {t) 



1 /sin(2*ii) 



n + 1 V sin(t/2) 

y (i--^-) 

k— — n 



71+ 1 

K——n 



Therefore 



< — j— J2 \ k W^\ 2 = [ [ (D n (y - x) - K n (y - x))dfi(x)dfx(y) 
= E \^\ 2 - / / K n (y - x)d(j,(x)d(i(y) . (5.4) 



k——n 



T JT 



Because \x n = /t_„, we can also sum from — n to n, changing only the 
constant C . If /i is not uniformly \fh continuous, there exists a sequence 
of intervals \Ik\ — > with fi(Ii) > l^Jh(\Ii\). A property of the Fejer kernel 
K n (t) is that for large enough n, there exists S > such that -^K n (t) > 
S > if 1 < < 7r/2. Choose n u so that 1 < ni ■ |//| < 7r/2. Using 
estimate (5.4), one gets 

> 5M^) 2 >« a MI^I) 

> C-fc(— ). 

ni 

This contradicts the existence of C such that 

1 " 1 

- E i^-i 2 < • 

n * — ' n 



n 

fe=-T 



□ 
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Theorem 5.12.7 (Strichartz). Let fi be a uniformly /i-continuous measure 
on the circle. There exists a constant C such that for all n 

i " . i 

-£l(/0«J a ^-M-). 

fc=i 



Proof. The computation ([106, 107] for the Fourier transform was adapted 
to Fourier series in [52]). In the following computation, we abbreviate d[i(x) 
with dx: 



k——n k=—n 

1 n—l (fc+e) 2 

2 e / V 6 / e -<(»-*)* dajdydg 



k——r 



=3 e 



> dddxdy 

k— — n 



1 e _i*^ +i{x _ y)f) 

T 2 JO 

n-1 _(*±e + i( a ._ 1/)5 )2 



e 

k— — n 



dOdxdy 

71 



and continue 

n-1 



- X] lAl! <5 e [ e 
n k=- n 



= -(x-i/) 



n 

k——n 

e -(£+*0c-»)i)~ 



= 6 e / [ / dile-^"^ - Accfy 

n 



(*-«)" 

n 



< 8 eV^F(/ e-^-y^'^ dxdy) 1/2 



=9 eVTr 

fe=0 Jfc/™<|a-?/|<(fe+l)/n 
oo 

< 10 e^CMn- 1 )(52r- k2 ' 2 ) 1 ' 2 

k=0 

<11 CT^n" 1 ) . 



-(x-y) 2 ?T dx dy y/2 
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Here are some remarks about the steps done in this computation: 
(1) is the trivial estimate 



\ n—l 



o 



y d6 > i 



(2) 



e-'U-*>* dn(x)dn(y) = / e^d^x) / e lxk d^{x) = [i k fi k = \ft k \< 

T 2 JT JT 

(3) uses Fubini's theorem. 

(4) is a completion of the square. 

(5) is the Cauchy-Schwartz inequality, 

(6) replaces a sum and the integral J Q by , 



w 

(7) uses - — " + n " 2 — dt = y/ir because 

f°° e -i,t/n+bf 

dt = i/n 

J-oo rt 

for all n and complex b, 

(8) is Jensen's inequality. 

(9) splits the integral over a sum of small intervals of strips of width 1/n. 

(10) uses the assumption that fi is /i-continuous. 

(11) This step uses that 



e -fc 2 /2U/2 



fc=0 

is a constant. □ 
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